1 00:00:00,000 --> 00:00:02,190 2 00:00:02,190 --> 00:00:04,640 The following content is provided under a Creative 3 00:00:04,640 --> 00:00:06,050 Commons license. 4 00:00:06,050 --> 00:00:09,090 Your support will help MIT OpenCourseWare continue to 5 00:00:09,090 --> 00:00:12,740 offer high quality educational resources for free. 6 00:00:12,740 --> 00:00:15,630 To make a donation or view additional materials from 7 00:00:15,630 --> 00:00:19,560 hundreds of MIT courses, visit MIT OpenCourseWare at 8 00:00:19,560 --> 00:00:20,810 ocw.mit.edu. 9 00:00:20,810 --> 00:00:31,380 10 00:00:31,380 --> 00:00:36,900 PROFESSOR: OK, so by this time, most of you should have 11 00:00:36,900 --> 00:00:40,790 had your meetings with your masters. 12 00:00:40,790 --> 00:00:44,650 How many of you had meetings with your masters? 13 00:00:44,650 --> 00:00:47,130 Who didn't yet? 14 00:00:47,130 --> 00:00:47,610 AUDIENCE: Today. 15 00:00:47,610 --> 00:00:48,370 PROFESSOR: Today? 16 00:00:48,370 --> 00:00:49,450 And Friday. 17 00:00:49,450 --> 00:00:53,060 Who hasn't scheduled a meeting? 18 00:00:53,060 --> 00:00:55,846 So are you talking to them, and do you know-- 19 00:00:55,846 --> 00:00:57,100 AUDIENCE: [INAUDIBLE] 20 00:00:57,100 --> 00:01:01,230 PROFESSOR: OK, make sure you get masters meetings scheduled 21 00:01:01,230 --> 00:01:03,130 because this is a very important part. 22 00:01:03,130 --> 00:01:04,890 We are getting feedback from the masters. 23 00:01:04,890 --> 00:01:08,820 I think they are giving you guys a lot of good advice, and 24 00:01:08,820 --> 00:01:14,900 so make sure you schedule, you go, you meet, get that result, 25 00:01:14,900 --> 00:01:15,590 use that result. 26 00:01:15,590 --> 00:01:19,870 It's very hard to arrange that kind of a senior person 27 00:01:19,870 --> 00:01:22,280 spending a lot of time with only two of 28 00:01:22,280 --> 00:01:23,520 you at a given time. 29 00:01:23,520 --> 00:01:25,940 That's a really good ratio to have, so take 30 00:01:25,940 --> 00:01:28,330 advantage of that. 31 00:01:28,330 --> 00:01:34,630 OK, so also, I think at this point, most of the beta issues 32 00:01:34,630 --> 00:01:40,275 for project one, we have been probably worked with, and they 33 00:01:40,275 --> 00:01:44,390 have some issues some people looked at that they are 34 00:01:44,390 --> 00:01:45,210 performance code. 35 00:01:45,210 --> 00:01:48,850 They were not happy because the thing is this. 36 00:01:48,850 --> 00:01:53,920 You can always run everything instantaneously if it doesn't 37 00:01:53,920 --> 00:01:56,480 have to produce a correct answer. 38 00:01:56,480 --> 00:02:00,580 You can just return, and so part of performance is also it 39 00:02:00,580 --> 00:02:03,590 has to have some correctness. 40 00:02:03,590 --> 00:02:05,760 So certain amount of correctness has to be passed 41 00:02:05,760 --> 00:02:07,270 before we can give performer grade. 42 00:02:07,270 --> 00:02:10,490 So some people have this question. 43 00:02:10,490 --> 00:02:12,850 Why are you giving me a correctness grade because it 44 00:02:12,850 --> 00:02:14,240 seemed to run? 45 00:02:14,240 --> 00:02:16,970 But if you haven't done the right checks and stuff like 46 00:02:16,970 --> 00:02:18,940 that, then it's unfair for other people because there's 47 00:02:18,940 --> 00:02:21,600 no way people who do the correct implementation can 48 00:02:21,600 --> 00:02:23,390 match that performance because we're using 49 00:02:23,390 --> 00:02:24,550 that as upper bound. 50 00:02:24,550 --> 00:02:26,110 So make sure you do that. 51 00:02:26,110 --> 00:02:30,660 And another issue people had was when you go to masters, 52 00:02:30,660 --> 00:02:34,730 you're going to work with one or two other students. 53 00:02:34,730 --> 00:02:38,470 And in your group, you're going to share all the 54 00:02:38,470 --> 00:02:42,260 material with your group member and nobody else. 55 00:02:42,260 --> 00:02:47,680 But between the beta and final, you get opportunity to 56 00:02:47,680 --> 00:02:52,460 basically get input from your masters, and also see what the 57 00:02:52,460 --> 00:02:55,720 other people who are coming to a project in your 58 00:02:55,720 --> 00:02:57,330 discussion have done. 59 00:02:57,330 --> 00:02:59,100 So it's OK for you to see those code. 60 00:02:59,100 --> 00:03:00,370 You can't take the code home. 61 00:03:00,370 --> 00:03:02,000 You can't just say give me a print-out, I am going to take 62 00:03:02,000 --> 00:03:02,860 home and re-implement. 63 00:03:02,860 --> 00:03:05,360 That doesn't count, but if you see some clever things they 64 00:03:05,360 --> 00:03:07,560 have done, it's all here to learn. 65 00:03:07,560 --> 00:03:10,240 Class is all the mostly about learning. 66 00:03:10,240 --> 00:03:12,650 However we have to grade you, so it's not all about grading, 67 00:03:12,650 --> 00:03:13,670 it's about learning. 68 00:03:13,670 --> 00:03:16,650 So we are trying to get a balance where we opportunity 69 00:03:16,650 --> 00:03:18,890 for you to learn from your peers. 70 00:03:18,890 --> 00:03:21,400 So if you go look at other people's code, if you find 71 00:03:21,400 --> 00:03:23,400 something cool they have done, learn. 72 00:03:23,400 --> 00:03:24,950 And also if you [UNINTELLIGIBLE] 73 00:03:24,950 --> 00:03:26,130 for you [UNINTELLIGIBLE] 74 00:03:26,130 --> 00:03:28,150 say, look, I did something cool. 75 00:03:28,150 --> 00:03:31,240 It's a good learning experience in there, so that's 76 00:03:31,240 --> 00:03:34,910 one opportunity you have to talk to other people on other 77 00:03:34,910 --> 00:03:37,850 groups, to basically learn something or find some 78 00:03:37,850 --> 00:03:40,510 interesting tidbits of how to get good performance. 79 00:03:40,510 --> 00:03:43,500 And hopefully, you learn some interesting things and able to 80 00:03:43,500 --> 00:03:45,770 implement that in your final. 81 00:03:45,770 --> 00:03:46,790 OK? 82 00:03:46,790 --> 00:03:51,370 So today we are going to talk about memory systems and 83 00:03:51,370 --> 00:03:53,900 performance engineering. 84 00:03:53,900 --> 00:04:00,640 So if you look at basic idea of what memory systems, you 85 00:04:00,640 --> 00:04:03,870 want to build a computer that seemed to have 86 00:04:03,870 --> 00:04:05,860 a really fast memory. 87 00:04:05,860 --> 00:04:07,750 You don't want things to be slow. 88 00:04:07,750 --> 00:04:10,110 For example, a long time ago, people build these computers 89 00:04:10,110 --> 00:04:13,590 where memory's always slow, everything [UNINTELLIGIBLE] 90 00:04:13,590 --> 00:04:14,350 further away. 91 00:04:14,350 --> 00:04:15,872 It doesn't help anybody. 92 00:04:15,872 --> 00:04:20,190 The way we know how to do that is to build small amount of 93 00:04:20,190 --> 00:04:23,460 memory very close to the processor. 94 00:04:23,460 --> 00:04:25,860 You can't build a huge amount of things all close to you. 95 00:04:25,860 --> 00:04:29,890 That doesn't work, doesn't scale like that. 96 00:04:29,890 --> 00:04:32,090 And we made a cache just like that, and you build larger and 97 00:04:32,090 --> 00:04:35,060 larger memory [UNINTELLIGIBLE] going down, and the illusion 98 00:04:35,060 --> 00:04:37,550 is you want to give the illusion of you have huge 99 00:04:37,550 --> 00:04:39,080 amount of memory. 100 00:04:39,080 --> 00:04:41,540 Everybody's very close to you. 101 00:04:41,540 --> 00:04:44,370 OK, so it looks like you have millions of friends, that you 102 00:04:44,370 --> 00:04:46,870 talk to everybody, and that doesn't happen. 103 00:04:46,870 --> 00:04:49,040 But how do you give that illusion? 104 00:04:49,040 --> 00:04:52,670 And the way that you give that illusion is when you use 105 00:04:52,670 --> 00:04:56,510 normal programming practice, when you normally using data, 106 00:04:56,510 --> 00:04:58,065 people have found there are two properties. 107 00:04:58,065 --> 00:05:01,820 108 00:05:01,820 --> 00:05:04,010 The first thing is called temporal locality. 109 00:05:04,010 --> 00:05:08,290 That means if I use some data item, there's a good chance I 110 00:05:08,290 --> 00:05:11,400 use that data item again very soon. 111 00:05:11,400 --> 00:05:16,150 There is something I haven't used for a long time, probably 112 00:05:16,150 --> 00:05:18,360 has a good chance I will not use it for a 113 00:05:18,360 --> 00:05:19,520 very, very long time. 114 00:05:19,520 --> 00:05:21,590 So can I take advantage of temporal locality? 115 00:05:21,590 --> 00:05:24,505 So that mean some data will be get for use a lot of time, in 116 00:05:24,505 --> 00:05:27,270 a very quick [UNINTELLIGIBLE], so those things should be very 117 00:05:27,270 --> 00:05:29,380 near to you because you might. 118 00:05:29,380 --> 00:05:31,060 Other one is spatial locality. 119 00:05:31,060 --> 00:05:33,640 That means if I use some data item, there's a good chance 120 00:05:33,640 --> 00:05:36,760 you'll be using data items next to that, closer to that. 121 00:05:36,760 --> 00:05:38,700 So can we take advantage of that? 122 00:05:38,700 --> 00:05:43,500 So those two properties help you help the compiler to fool 123 00:05:43,500 --> 00:05:46,940 the system, fool everybody, thinking that it has this huge 124 00:05:46,940 --> 00:05:50,310 amount of memory very close to you, but internally doesn't. 125 00:05:50,310 --> 00:05:53,690 Unfortunately, since it's trying to fool you, it doesn't 126 00:05:53,690 --> 00:05:56,110 work all the time, and when it doesn't work, it's good to 127 00:05:56,110 --> 00:06:01,000 recognize and able to fix those things. 128 00:06:01,000 --> 00:06:05,280 So I showed this picture sometime ago, too. 129 00:06:05,280 --> 00:06:08,440 Memories in a big hierarchy. 130 00:06:08,440 --> 00:06:12,630 L1, L2, in fact, we have L3 cache and a memory, disk, 131 00:06:12,630 --> 00:06:14,130 tapes, it can go up and down. 132 00:06:14,130 --> 00:06:16,662 And the key thing is when you go here, you're starting from 133 00:06:16,662 --> 00:06:18,630 registers, you have very small amount of 134 00:06:18,630 --> 00:06:20,570 things, very fast access. 135 00:06:20,570 --> 00:06:22,440 Here, we have almost infinite amount of 136 00:06:22,440 --> 00:06:24,290 things, very slow access. 137 00:06:24,290 --> 00:06:28,500 And the two reasons why this has to be smart because first 138 00:06:28,500 --> 00:06:30,510 of all, those things are very expensive. 139 00:06:30,510 --> 00:06:32,140 And second of course, you can't [UNINTELLIGIBLE] too 140 00:06:32,140 --> 00:06:34,680 much, and when you go down, things get somewhat cheaper. 141 00:06:34,680 --> 00:06:37,750 So that was one of the kind of economic incentive here. 142 00:06:37,750 --> 00:06:40,900 143 00:06:40,900 --> 00:06:43,690 So I talked about cache issues, and I'm going to go 144 00:06:43,690 --> 00:06:46,450 through this again because we are really going to dig deep 145 00:06:46,450 --> 00:06:47,730 into these things. 146 00:06:47,730 --> 00:06:50,430 So when you have a cache, that means you have a small memory 147 00:06:50,430 --> 00:06:53,790 close to you that's basically shadowing something sitting 148 00:06:53,790 --> 00:06:55,070 behind you. 149 00:06:55,070 --> 00:06:58,490 So when you ask for the data in the cache, there are many 150 00:06:58,490 --> 00:07:01,100 reasons why it's not there. 151 00:07:01,100 --> 00:07:03,090 First thing is cold miss. 152 00:07:03,090 --> 00:07:05,760 What that means is you're asking for something that you 153 00:07:05,760 --> 00:07:06,645 have never asked before. 154 00:07:06,645 --> 00:07:09,500 It's the first time you're asking, so the data is 155 00:07:09,500 --> 00:07:11,890 probably sitting way behind in your memory where, 156 00:07:11,890 --> 00:07:13,180 [UNINTELLIGIBLE] 157 00:07:13,180 --> 00:07:13,580 in a disk. 158 00:07:13,580 --> 00:07:15,330 We haven't even loaded it in there. 159 00:07:15,330 --> 00:07:21,780 And the first time you have to get it, so that get pulled in, 160 00:07:21,780 --> 00:07:26,020 and so this seems to be a place where there's nothing 161 00:07:26,020 --> 00:07:27,080 you can do. 162 00:07:27,080 --> 00:07:32,460 But another part of locality is a thing called prefetching. 163 00:07:32,460 --> 00:07:37,350 These modern processors have a crystal ball. 164 00:07:37,350 --> 00:07:39,780 They carry most of the things that branch prediction and 165 00:07:39,780 --> 00:07:43,980 stuff is trying to predict what you want to do. 166 00:07:43,980 --> 00:07:46,320 You can do the same thing in memory. 167 00:07:46,320 --> 00:07:48,550 You can look at what you have done. 168 00:07:48,550 --> 00:07:51,470 And Intel has this huge amount of circuitry that they don't 169 00:07:51,470 --> 00:07:54,380 tell anybody what they do, but they're trying to predict what 170 00:07:54,380 --> 00:07:57,060 you might fetch next, what data you might look for next. 171 00:07:57,060 --> 00:08:01,610 And if it is working well, something you had never, ever 172 00:08:01,610 --> 00:08:04,590 looked before, it might deduce that you might need it and go 173 00:08:04,590 --> 00:08:05,560 get it for you. 174 00:08:05,560 --> 00:08:10,130 So if that works, great. 175 00:08:10,130 --> 00:08:12,790 So then there's thing called capacity miss. 176 00:08:12,790 --> 00:08:18,240 What that means is as I said, I am trying to keep my best 177 00:08:18,240 --> 00:08:22,330 friends, or the people I will be working with a lot-- 178 00:08:22,330 --> 00:08:23,000 Oh, nice. 179 00:08:23,000 --> 00:08:26,240 What's going on here? 180 00:08:26,240 --> 00:08:28,040 Please start later. 181 00:08:28,040 --> 00:08:29,290 OK. 182 00:08:29,290 --> 00:08:32,169 183 00:08:32,169 --> 00:08:35,820 --the people I'm going to work a lot with, close to me. 184 00:08:35,820 --> 00:08:38,880 The problem is I can have only certain number of friends, and 185 00:08:38,880 --> 00:08:41,890 if you have more than that, I can't keep everybody. 186 00:08:41,890 --> 00:08:46,430 So hopefully, your program is going to use a lot what we 187 00:08:46,430 --> 00:08:49,400 call the working set that can fit in the cache. 188 00:08:49,400 --> 00:08:52,030 And at that point, you get all of those people, like I said, 189 00:08:52,030 --> 00:08:56,600 get everybody into one room, and the problem is if your 190 00:08:56,600 --> 00:08:58,150 party spills over from the room, then 191 00:08:58,150 --> 00:08:59,340 there's a lot of issues. 192 00:08:59,340 --> 00:09:01,570 But if everybody fits in the room, things are very nice. 193 00:09:01,570 --> 00:09:03,390 You can have a good interaction, continuous 194 00:09:03,390 --> 00:09:04,850 interaction, with them. 195 00:09:04,850 --> 00:09:10,900 And what happens is, if not, the worst case is the caches 196 00:09:10,900 --> 00:09:15,370 have eviction policy, normally is called least recently used. 197 00:09:15,370 --> 00:09:18,480 That means if I don't have room in the cache, I figure 198 00:09:18,480 --> 00:09:21,500 out the cache line that I haven't touched for longest, 199 00:09:21,500 --> 00:09:24,710 and I get rid of that, bring the next one. 200 00:09:24,710 --> 00:09:27,250 So that's a good policy. 201 00:09:27,250 --> 00:09:29,740 It fits in your locality thing, but when 202 00:09:29,740 --> 00:09:30,990 will did not work? 203 00:09:30,990 --> 00:09:33,620 204 00:09:33,620 --> 00:09:34,220 [UNINTELLIGIBLE] 205 00:09:34,220 --> 00:09:39,880 going to figure out when least recently used will create a 206 00:09:39,880 --> 00:09:42,257 really bad problem in capacity. 207 00:09:42,257 --> 00:09:43,507 AUDIENCE: [INAUDIBLE] 208 00:09:43,507 --> 00:09:45,410 209 00:09:45,410 --> 00:09:46,911 just one. 210 00:09:46,911 --> 00:09:51,060 Say you're [INAUDIBLE] set, and that set is, by one, 211 00:09:51,060 --> 00:09:52,890 bigger than the [INAUDIBLE] 212 00:09:52,890 --> 00:09:53,630 missing. 213 00:09:53,630 --> 00:09:54,205 PROFESSOR: Very good. 214 00:09:54,205 --> 00:09:56,900 So what you're going through, you're going round some data 215 00:09:56,900 --> 00:10:01,540 again and again, and that amount of data is one bigger 216 00:10:01,540 --> 00:10:03,210 than your cache line. 217 00:10:03,210 --> 00:10:04,670 You are going to have no locality because 218 00:10:04,670 --> 00:10:07,830 [UNINTELLIGIBLE], and when you go to, just before you use 219 00:10:07,830 --> 00:10:10,360 that data, they said, oops, I need to bring one more data. 220 00:10:10,360 --> 00:10:13,950 I need to evict something, and who am I going to evict? 221 00:10:13,950 --> 00:10:17,630 The guy I'm going to use next because it was the least 222 00:10:17,630 --> 00:10:18,340 recently used thing. 223 00:10:18,340 --> 00:10:21,100 So it goes, and then you go there, and you bring that. 224 00:10:21,100 --> 00:10:21,660 Who's it going to evict? 225 00:10:21,660 --> 00:10:23,580 You're going to evict the one I just brought. 226 00:10:23,580 --> 00:10:26,630 You're going to evict the next one that I am just about to 227 00:10:26,630 --> 00:10:28,020 use and bring that one. 228 00:10:28,020 --> 00:10:30,450 And you go use that, and then to use the next one, you're 229 00:10:30,450 --> 00:10:32,690 going evict the one you're just about to use. 230 00:10:32,690 --> 00:10:35,480 So by going and doing that, you basically get no locality. 231 00:10:35,480 --> 00:10:37,314 So that's a really bad problem in here. 232 00:10:37,314 --> 00:10:39,920 233 00:10:39,920 --> 00:10:43,370 So another interesting thing is called a conflict misses. 234 00:10:43,370 --> 00:10:46,920 Conflict misses says normally, when you have a cache-- 235 00:10:46,920 --> 00:10:50,360 so what we have is you have this large storage, which 236 00:10:50,360 --> 00:10:51,050 [UNINTELLIGIBLE] 237 00:10:51,050 --> 00:10:55,100 the memory, you're going to map into smaller storage in 238 00:10:55,100 --> 00:10:57,420 here, which is cache. 239 00:10:57,420 --> 00:11:01,620 So one way to say is any of this value line can be 240 00:11:01,620 --> 00:11:03,550 anywhere here. 241 00:11:03,550 --> 00:11:05,720 That's called fully associative cache. 242 00:11:05,720 --> 00:11:08,670 243 00:11:08,670 --> 00:11:12,000 Implementing that is somewhat complicated, so the other 244 00:11:12,000 --> 00:11:13,940 extreme is called direct map cache. 245 00:11:13,940 --> 00:11:17,080 What that means is this segment, the same size here, 246 00:11:17,080 --> 00:11:20,570 can map into here, and then the next segment also 247 00:11:20,570 --> 00:11:21,370 can map into here. 248 00:11:21,370 --> 00:11:25,490 That means if you get this cache line here, it can be 249 00:11:25,490 --> 00:11:30,200 only mapped into one place in this here. 250 00:11:30,200 --> 00:11:33,800 This cache line can only be here, and also this cache line 251 00:11:33,800 --> 00:11:37,620 can also only be here if it is the same offset. 252 00:11:37,620 --> 00:11:40,000 So that means for every cache line, there's only one 253 00:11:40,000 --> 00:11:41,980 location in the cache. 254 00:11:41,980 --> 00:11:45,550 So here, what would be really sad scenario? 255 00:11:45,550 --> 00:11:51,214 256 00:11:51,214 --> 00:11:53,580 AUDIENCE: [INAUDIBLE] 257 00:11:53,580 --> 00:11:56,006 PROFESSOR: Yeah, I mean I assume I have a 258 00:11:56,006 --> 00:11:57,010 lot of these slots. 259 00:11:57,010 --> 00:11:59,860 I read something here, and the next time I read something 260 00:11:59,860 --> 00:12:01,750 here, next time I read something here, and I go round 261 00:12:01,750 --> 00:12:03,350 and round this way. 262 00:12:03,350 --> 00:12:07,000 I might be touching only very few number of data items, but 263 00:12:07,000 --> 00:12:09,630 they also mapping into one cache line even though my 264 00:12:09,630 --> 00:12:13,320 working set is very small than the cache. 265 00:12:13,320 --> 00:12:18,340 I still don't have any cache usage 266 00:12:18,340 --> 00:12:20,840 because I am having conflicts. 267 00:12:20,840 --> 00:12:23,430 And then there are two other misses when you're going to 268 00:12:23,430 --> 00:12:24,220 multi-process. 269 00:12:24,220 --> 00:12:26,800 We'll go through them in later lectures. 270 00:12:26,800 --> 00:12:28,270 One is called true sharing. 271 00:12:28,270 --> 00:12:33,560 That means there are caches private to each core, and what 272 00:12:33,560 --> 00:12:35,780 happens is if one core uses some data, you've have to get 273 00:12:35,780 --> 00:12:37,330 the cache line there. 274 00:12:37,330 --> 00:12:39,100 If the other core wants to use that data, you have to bring 275 00:12:39,100 --> 00:12:42,220 the cache line back there, and the cache lines can ping-pong 276 00:12:42,220 --> 00:12:43,940 [UNINTELLIGIBLE] that. 277 00:12:43,940 --> 00:12:46,090 False sharing is even a worst way. 278 00:12:46,090 --> 00:12:50,700 So what happens is if the first processor here touches 279 00:12:50,700 --> 00:12:53,910 this value of the cache line, second processor touches this 280 00:12:53,910 --> 00:12:55,841 value of the cache line. 281 00:12:55,841 --> 00:12:59,320 We're not sharing anything, but unfortunately, the two 282 00:12:59,320 --> 00:13:01,920 things I'm using sits next to each other, so when I ride 283 00:13:01,920 --> 00:13:04,500 this thing, I get the cache line, he doesn't. 284 00:13:04,500 --> 00:13:05,660 When he need to ride this thing, he has to 285 00:13:05,660 --> 00:13:06,510 get the cache line. 286 00:13:06,510 --> 00:13:10,860 So I am seemingly using independent data, but I am 287 00:13:10,860 --> 00:13:13,720 bouncing cache line in between, and if that happened, 288 00:13:13,720 --> 00:13:16,880 that going to have a huge big performance impact. 289 00:13:16,880 --> 00:13:17,610 So these other things. 290 00:13:17,610 --> 00:13:20,850 The last two, we'll get to a little bit later. 291 00:13:20,850 --> 00:13:26,010 So today I am going to start with modeling how caches work 292 00:13:26,010 --> 00:13:28,180 with a very simplistic cache. 293 00:13:28,180 --> 00:13:30,840 So here's, assume, my cache. 294 00:13:30,840 --> 00:13:32,730 I have 32 kilobytes in here. 295 00:13:32,730 --> 00:13:35,300 296 00:13:35,300 --> 00:13:36,640 This is a direct map cache. 297 00:13:36,640 --> 00:13:41,600 That means that in my memory, 32 kilobyte chucks get mapped 298 00:13:41,600 --> 00:13:42,900 to direct [UNINTELLIGIBLE]. 299 00:13:42,900 --> 00:13:45,060 This 32 kilobyte get mapped here. 300 00:13:45,060 --> 00:13:47,970 That 32 kilobyte get mapped here. 301 00:13:47,970 --> 00:13:57,890 And the cache line inside is 64 bytes, 64, and that means I 302 00:13:57,890 --> 00:14:00,780 have basically [UNINTELLIGIBLE PHRASE]. 303 00:14:00,780 --> 00:14:03,280 304 00:14:03,280 --> 00:14:04,590 32 kilobyte in here. 305 00:14:04,590 --> 00:14:05,700 OK? 306 00:14:05,700 --> 00:14:12,770 So just remember this as we go on doing this, we will use 307 00:14:12,770 --> 00:14:13,530 this as a formula. 308 00:14:13,530 --> 00:14:14,130 OK? 309 00:14:14,130 --> 00:14:16,340 And we assume if you getting the cache, [UNINTELLIGIBLE] 310 00:14:16,340 --> 00:14:19,400 single cycle, if you miss 100 cycles. 311 00:14:19,400 --> 00:14:22,210 OK, so it's nice numbers to do that. 312 00:14:22,210 --> 00:14:28,630 And so the first thing is we have a reference like this. 313 00:14:28,630 --> 00:14:29,210 OK? 314 00:14:29,210 --> 00:14:35,820 You go from [i] equals 0 to [i] less than very large 315 00:14:35,820 --> 00:14:38,640 number, A[i]. 316 00:14:38,640 --> 00:14:41,600 I just go to accessing memory like one after another after 317 00:14:41,600 --> 00:14:42,850 another after another. 318 00:14:42,850 --> 00:14:46,610 319 00:14:46,610 --> 00:14:49,070 So you see how this is going on in here? 320 00:14:49,070 --> 00:14:51,580 So also I am accessing integers, so that means four 321 00:14:51,580 --> 00:14:54,460 bytes at a time in cache line. 322 00:14:54,460 --> 00:14:55,200 So what happens here? 323 00:14:55,200 --> 00:14:58,800 So I assume size of int is four, so you're getting four 324 00:14:58,800 --> 00:15:01,490 bytes in here. 325 00:15:01,490 --> 00:15:03,610 So yeah, you're doing s reads to A. I'm accessing 326 00:15:03,610 --> 00:15:06,240 s elements in here. 327 00:15:06,240 --> 00:15:10,490 And you have 16 elements of a per cache line. 328 00:15:10,490 --> 00:15:12,220 OK, you see why 16 if there? 329 00:15:12,220 --> 00:15:19,010 Because I have 64 in here, 64 bytes in here. 330 00:15:19,010 --> 00:15:22,340 Each time axises four bytes, so I have 16 of integers in 331 00:15:22,340 --> 00:15:24,748 cache line. 332 00:15:24,748 --> 00:15:26,134 AUDIENCE: [INAUDIBLE] 333 00:15:26,134 --> 00:15:27,520 bits or bytes? 334 00:15:27,520 --> 00:15:28,330 PROFESSOR: Bytes. 335 00:15:28,330 --> 00:15:29,010 Byte. 336 00:15:29,010 --> 00:15:29,775 Did I say bits? 337 00:15:29,775 --> 00:15:32,690 No, bytes. 338 00:15:32,690 --> 00:15:37,322 So what happens is if it axises the same element, 15 of 339 00:15:37,322 --> 00:15:40,680 every 16 is in the cache because when we are going one 340 00:15:40,680 --> 00:15:44,230 after the first guy, cache miss, I had to go bring it, 341 00:15:44,230 --> 00:15:46,240 and the next 15, I have brought to it. 342 00:15:46,240 --> 00:15:47,360 It's in the cache. 343 00:15:47,360 --> 00:15:51,170 So what should be my cost? 344 00:15:51,170 --> 00:15:51,880 [UNINTELLIGIBLE] 345 00:15:51,880 --> 00:15:53,880 cost of memory access. 346 00:15:53,880 --> 00:15:57,840 347 00:15:57,840 --> 00:16:03,715 I am accessing s data, and 1/16th of that is a cache 348 00:16:03,715 --> 00:16:07,320 miss, 15/16th is a cache hit. 349 00:16:07,320 --> 00:16:09,320 So basically total [UNINTELLIGIBLE] 350 00:16:09,320 --> 00:16:15,970 is a 15/16th of s is a cache hit, and other 1/16th I have 351 00:16:15,970 --> 00:16:21,970 100 times cycles I need because it's a cache miss. 352 00:16:21,970 --> 00:16:24,210 Everybody good so far? 353 00:16:24,210 --> 00:16:27,110 OK, so what type of locality do we have here? 354 00:16:27,110 --> 00:16:29,760 355 00:16:29,760 --> 00:16:31,090 First of all, do we have locality? 356 00:16:31,090 --> 00:16:33,870 357 00:16:33,870 --> 00:16:36,231 Who thinks we have locality? 358 00:16:36,231 --> 00:16:38,480 OK, lot of people thinks we have locality. 359 00:16:38,480 --> 00:16:40,400 What type? 360 00:16:40,400 --> 00:16:41,370 AUDIENCE: [INAUDIBLE] 361 00:16:41,370 --> 00:16:44,090 PROFESSOR: Spatial locality because what we are doing is 362 00:16:44,090 --> 00:16:47,030 we are accessing the nearby elements even though we are 363 00:16:47,030 --> 00:16:49,870 not getting back any of the data yet, so we get some 364 00:16:49,870 --> 00:16:51,810 spacial locality axis here. 365 00:16:51,810 --> 00:16:55,150 That's good, so this is our very -- most simple 366 00:16:55,150 --> 00:16:57,220 thing we can do. 367 00:16:57,220 --> 00:16:59,300 So what kind of misses are we going to have in the cache? 368 00:16:59,300 --> 00:17:06,119 369 00:17:06,119 --> 00:17:07,869 Cold misses. 370 00:17:07,869 --> 00:17:09,859 I'm missing because I have never seen that data before, 371 00:17:09,859 --> 00:17:13,280 so it's a cold miss in here. 372 00:17:13,280 --> 00:17:15,550 OK, so that's that. 373 00:17:15,550 --> 00:17:17,700 Let's look at this one. 374 00:17:17,700 --> 00:17:18,950 I'm accessing A[0] 375 00:17:18,950 --> 00:17:21,780 376 00:17:21,780 --> 00:17:23,030 s times. 377 00:17:23,030 --> 00:17:25,690 378 00:17:25,690 --> 00:17:27,400 OK? 379 00:17:27,400 --> 00:17:28,852 What's the total access time? 380 00:17:28,852 --> 00:17:40,950 381 00:17:40,950 --> 00:17:44,360 How many cache misses am I going to get? 382 00:17:44,360 --> 00:17:50,780 One, so I basically have 100 time for the first cache miss, 383 00:17:50,780 --> 00:17:53,740 and s minus 1 for the rest basically it's a hit. 384 00:17:53,740 --> 00:17:55,080 OK? 385 00:17:55,080 --> 00:17:56,760 That's good, so what kind of locality's this? 386 00:17:56,760 --> 00:18:02,070 387 00:18:02,070 --> 00:18:03,030 Spatial or temporal. 388 00:18:03,030 --> 00:18:05,490 There's not too many choices. 389 00:18:05,490 --> 00:18:06,740 Or none. 390 00:18:06,740 --> 00:18:09,140 391 00:18:09,140 --> 00:18:10,390 How many people think spatial? 392 00:18:10,390 --> 00:18:12,850 393 00:18:12,850 --> 00:18:15,770 How many people think temporal? 394 00:18:15,770 --> 00:18:19,230 How many people said there's none locality? 395 00:18:19,230 --> 00:18:20,990 OK, that's a lot of temporal. 396 00:18:20,990 --> 00:18:24,210 OK, you want to get [UNINTELLIGIBLE]. 397 00:18:24,210 --> 00:18:26,060 That's temporal locality here because you're accessing to 398 00:18:26,060 --> 00:18:28,650 the same thing again and again and again, same data. 399 00:18:28,650 --> 00:18:28,910 So this is-- 400 00:18:28,910 --> 00:18:30,056 Oh, OK. 401 00:18:30,056 --> 00:18:32,350 Want to restart, OK. 402 00:18:32,350 --> 00:18:34,826 Wait a little. 403 00:18:34,826 --> 00:18:36,810 So this is only time we notice in a hurry. 404 00:18:36,810 --> 00:18:38,840 Every time, I go to do something, I get hourglass. 405 00:18:38,840 --> 00:18:41,630 406 00:18:41,630 --> 00:18:45,190 So what kind of misses are we getting? 407 00:18:45,190 --> 00:18:46,440 Trick question. 408 00:18:46,440 --> 00:18:54,150 409 00:18:54,150 --> 00:18:57,510 I got one cold miss, and that's about it. 410 00:18:57,510 --> 00:18:58,840 And the rest I don't have any misses. 411 00:18:58,840 --> 00:19:01,410 So if I have a miss, it's a cold miss. 412 00:19:01,410 --> 00:19:05,260 OK, so here I am doing something interesting. 413 00:19:05,260 --> 00:19:08,720 So what the heck is in this list? 414 00:19:08,720 --> 00:19:10,840 So I'm accessing A[i] 415 00:19:10,840 --> 00:19:11,840 [UNINTELLIGIBLE] 416 00:19:11,840 --> 00:19:14,860 this 2 to the power number, one shifted to the entire 417 00:19:14,860 --> 00:19:21,470 miss, 2 to the power N. I shifted between 4 and 13. 418 00:19:21,470 --> 00:19:22,500 What this 13? 419 00:19:22,500 --> 00:19:24,080 Why is 13? 420 00:19:24,080 --> 00:19:25,330 Less than 13. 421 00:19:25,330 --> 00:19:29,590 422 00:19:29,590 --> 00:19:31,060 What's 2 the power of 13? 423 00:19:31,060 --> 00:19:32,310 I think I got this right. 424 00:19:32,310 --> 00:19:37,625 425 00:19:37,625 --> 00:19:43,070 AUDIENCE: 8,192. 426 00:19:43,070 --> 00:19:44,370 PROFESSOR: What's 2 to the power of 13? 427 00:19:44,370 --> 00:19:45,270 AUDIENCE: 8,192. 428 00:19:45,270 --> 00:19:46,890 PROFESSOR: Yeah, 8K. 429 00:19:46,890 --> 00:19:50,330 And I have 32K. 430 00:19:50,330 --> 00:19:54,290 So 8K of 32 [UNINTELLIGIBLE]. 431 00:19:54,290 --> 00:19:59,680 8K of integers, which is each as 4 bytes is how much? 432 00:19:59,680 --> 00:20:00,140 32K. 433 00:20:00,140 --> 00:20:05,410 So what this says is everything I access should fit 434 00:20:05,410 --> 00:20:05,900 into the cache. 435 00:20:05,900 --> 00:20:07,080 I am accessing data like that. 436 00:20:07,080 --> 00:20:08,680 I'm going back and accessing data like that. 437 00:20:08,680 --> 00:20:11,520 I'm going back and accessing data like that, and it should 438 00:20:11,520 --> 00:20:12,700 all fit into the cache. 439 00:20:12,700 --> 00:20:14,270 Do you see what I do? 440 00:20:14,270 --> 00:20:18,370 I access up to 2 to the power of 18, and I go back and 441 00:20:18,370 --> 00:20:19,060 access that again. 442 00:20:19,060 --> 00:20:20,640 I'm just going back and back because 443 00:20:20,640 --> 00:20:23,150 of the model operation. 444 00:20:23,150 --> 00:20:25,670 Everybody see what's going on? 445 00:20:25,670 --> 00:20:27,740 So how many cache misses should I have? 446 00:20:27,740 --> 00:20:40,916 447 00:20:40,916 --> 00:20:44,870 AUDIENCE: There are four cache [INAUDIBLE]. 448 00:20:44,870 --> 00:20:48,090 PROFESSOR: OK, how many cache lines I would be accessing if 449 00:20:48,090 --> 00:20:52,120 I am doing i1 2 to the power N? 450 00:20:52,120 --> 00:20:54,875 451 00:20:54,875 --> 00:20:57,070 [i] 452 00:20:57,070 --> 00:21:00,974 mod 2 to the power N. 453 00:21:00,974 --> 00:21:02,224 AUDIENCE: [INAUDIBLE] 454 00:21:02,224 --> 00:21:17,720 455 00:21:17,720 --> 00:21:20,240 PROFESSOR: So one miss for each axis line 456 00:21:20,240 --> 00:21:21,070 the first time around. 457 00:21:21,070 --> 00:21:23,290 Afterward, it's only in the cache. 458 00:21:23,290 --> 00:21:28,430 So how many axis lines? 459 00:21:28,430 --> 00:21:30,000 2 to the power N. 460 00:21:30,000 --> 00:21:30,770 Is that right? 461 00:21:30,770 --> 00:21:32,020 I think this is wrong. 462 00:21:32,020 --> 00:21:36,087 463 00:21:36,087 --> 00:21:40,510 Oh, yeah, this is right because every axis-- 464 00:21:40,510 --> 00:21:44,676 So in the cache line, how many-- 465 00:21:44,676 --> 00:21:45,926 AUDIENCE: [INAUDIBLE] 466 00:21:45,926 --> 00:21:48,150 467 00:21:48,150 --> 00:21:49,400 PROFESSOR: This is four. 468 00:21:49,400 --> 00:21:51,730 469 00:21:51,730 --> 00:21:54,440 So there are 16 entries here. 470 00:21:54,440 --> 00:21:57,760 471 00:21:57,760 --> 00:21:59,780 16 entries in here. 472 00:21:59,780 --> 00:22:05,470 OK, only one of them basically misses that, so that means if 473 00:22:05,470 --> 00:22:09,224 2 to the power N axises-- 474 00:22:09,224 --> 00:22:11,090 AUDIENCE: [INAUDIBLE] 475 00:22:11,090 --> 00:22:17,680 PROFESSOR: You are doing 2 to the power N axises before you 476 00:22:17,680 --> 00:22:18,720 go back in here. 477 00:22:18,720 --> 00:22:22,260 So what you're doing is you are-- 478 00:22:22,260 --> 00:22:27,320 you make this many axises before you go back again. 479 00:22:27,320 --> 00:22:28,910 OK? 480 00:22:28,910 --> 00:22:34,130 And how many axises assigned to be in the same cache line? 481 00:22:34,130 --> 00:22:37,490 You have 64 bytes in the cache. 482 00:22:37,490 --> 00:22:41,730 OK each axis is four. 483 00:22:41,730 --> 00:22:43,134 16. 484 00:22:43,134 --> 00:22:46,040 AUDIENCE: [INAUDIBLE] 485 00:22:46,040 --> 00:22:48,210 PROFESSOR: 16. 486 00:22:48,210 --> 00:22:49,430 So that's true [UNINTELLIGIBLE]. 487 00:22:49,430 --> 00:22:50,680 AUDIENCE: [INAUDIBLE] 488 00:22:50,680 --> 00:22:53,410 489 00:22:53,410 --> 00:22:56,510 PROFESSOR: Yeah, but what I'm saying is I might not access 490 00:22:56,510 --> 00:22:57,250 the [UNINTELLIGIBLE] 491 00:22:57,250 --> 00:22:58,890 because if any small. 492 00:22:58,890 --> 00:23:00,760 I only access a certain amount of cache line, so I'm 493 00:23:00,760 --> 00:23:02,760 accessing a part of the cache. 494 00:23:02,760 --> 00:23:04,520 I might not go through the-- can everybody see this, how 495 00:23:04,520 --> 00:23:06,160 this is working? 496 00:23:06,160 --> 00:23:12,170 Because if I'm only accessing, we'll say, accessing one, if 497 00:23:12,170 --> 00:23:16,472 this is, we'll say, 2 to the power of 5. 498 00:23:16,472 --> 00:23:18,570 OK, I am not going to access the entire cache line. 499 00:23:18,570 --> 00:23:18,990 [UNINTELLIGIBLE] 500 00:23:18,990 --> 00:23:22,100 cache is I'm going to [UNINTELLIGIBLE] 501 00:23:22,100 --> 00:23:23,220 go through the cache. 502 00:23:23,220 --> 00:23:27,670 So this many cache lines I'm accessing. 503 00:23:27,670 --> 00:23:31,665 OK, and the first time I access that, I get a cache 504 00:23:31,665 --> 00:23:34,680 miss, and after that, it say everything is in the cache. 505 00:23:34,680 --> 00:23:38,820 506 00:23:38,820 --> 00:23:39,460 Everybody's following me? 507 00:23:39,460 --> 00:23:41,650 Or are we like lost in here? 508 00:23:41,650 --> 00:23:44,540 How many people are lost? 509 00:23:44,540 --> 00:23:46,260 OK, let's do this. 510 00:23:46,260 --> 00:23:49,430 511 00:23:49,430 --> 00:23:57,120 So what happens is if you have a modular 2 to the power N. 512 00:23:57,120 --> 00:24:03,390 What that means is I'm going to keep accessing one to 513 00:24:03,390 --> 00:24:07,900 somewhere 2 to the power N, and the next one, I am going 514 00:24:07,900 --> 00:24:09,150 back here again. 515 00:24:09,150 --> 00:24:12,800 516 00:24:12,800 --> 00:24:14,580 That's my axises, basically, because it goes 517 00:24:14,580 --> 00:24:16,620 [UNINTELLIGIBLE] 518 00:24:16,620 --> 00:24:18,380 accessing to the power of N. 519 00:24:18,380 --> 00:24:23,840 I nicely said n is between 4 and 13. 520 00:24:23,840 --> 00:24:38,820 So n is 13 means the maximum I can do is 8K. 521 00:24:38,820 --> 00:24:40,070 8K increase. 522 00:24:40,070 --> 00:24:42,340 523 00:24:42,340 --> 00:24:51,470 8K increase equals 32K kilobytes of memory. 524 00:24:51,470 --> 00:24:54,540 525 00:24:54,540 --> 00:24:56,580 So that means you don't have any option of 526 00:24:56,580 --> 00:24:59,660 overwriting the cache. 527 00:24:59,660 --> 00:25:02,790 So you go to cache, [UNINTELLIGIBLE]. 528 00:25:02,790 --> 00:25:04,880 You don't wrap around in the cache. 529 00:25:04,880 --> 00:25:06,325 Do you see that? 530 00:25:06,325 --> 00:25:09,450 OK, you know wrap arounding is so bad, that means all these 531 00:25:09,450 --> 00:25:12,110 things is going to fit in the cache, and then when you get 532 00:25:12,110 --> 00:25:14,860 [UNINTELLIGIBLE], this is going to be in the cache 533 00:25:14,860 --> 00:25:22,630 because the data you access is smaller than the entire cache. 534 00:25:22,630 --> 00:25:25,760 OK, so what that means is only the first line has to have 535 00:25:25,760 --> 00:25:28,410 some cache misses. 536 00:25:28,410 --> 00:25:29,965 So how many cache misses you are going to have 537 00:25:29,965 --> 00:25:30,470 in the first line? 538 00:25:30,470 --> 00:25:32,790 Or any of the lines going to have cache misses because the 539 00:25:32,790 --> 00:25:34,430 first line still fit in the cache. 540 00:25:34,430 --> 00:25:34,970 [UNINTELLIGIBLE] 541 00:25:34,970 --> 00:25:35,280 anything. 542 00:25:35,280 --> 00:25:36,540 We don't need everything. 543 00:25:36,540 --> 00:25:38,410 How many cache misses have in here? 544 00:25:38,410 --> 00:25:45,410 The interesting thing is because these are four bytes, 545 00:25:45,410 --> 00:25:55,830 we have a 64-byte cache line, and each is four bytes. 546 00:25:55,830 --> 00:25:58,390 547 00:25:58,390 --> 00:26:03,980 So that means I can do 16 axises every time I have a 548 00:26:03,980 --> 00:26:06,550 cache miss. 549 00:26:06,550 --> 00:26:13,280 OK, so up to here to here is 2 to the N. My cache misses I do 550 00:26:13,280 --> 00:26:13,960 [UNINTELLIGIBLE] 551 00:26:13,960 --> 00:26:14,690 basically. 552 00:26:14,690 --> 00:26:18,110 My cache misses is basically this much. 553 00:26:18,110 --> 00:26:21,460 554 00:26:21,460 --> 00:26:23,280 Everything else is in the cache, and after the first 555 00:26:23,280 --> 00:26:24,810 line, everything else is in the cache again, 556 00:26:24,810 --> 00:26:27,030 [UNINTELLIGIBLE] nicely, got my working set 557 00:26:27,030 --> 00:26:28,140 to fit in the cache. 558 00:26:28,140 --> 00:26:30,180 I'm really happy. 559 00:26:30,180 --> 00:26:31,580 Everybody see this? 560 00:26:31,580 --> 00:26:34,530 How many people don't see it now? 561 00:26:34,530 --> 00:26:36,010 OK, good. 562 00:26:36,010 --> 00:26:39,380 So let's move to something a little bit more complicated. 563 00:26:39,380 --> 00:26:40,630 What kind of locality do we have? 564 00:26:40,630 --> 00:26:45,698 565 00:26:45,698 --> 00:26:47,570 AUDIENCE: Temporal and spatial. 566 00:26:47,570 --> 00:26:47,830 PROFESSOR: Good. 567 00:26:47,830 --> 00:26:51,560 We have temporal and spatial. 568 00:26:51,560 --> 00:26:55,090 You have spatial because every time you go, you only get the 569 00:26:55,090 --> 00:26:56,060 first line of the cache. 570 00:26:56,060 --> 00:26:58,850 Rest is in the cache. 571 00:26:58,850 --> 00:27:01,720 From that, I got a 16x improvement because I have 572 00:27:01,720 --> 00:27:02,990 temporal locality. 573 00:27:02,990 --> 00:27:06,690 And since I only access when I go back to the data, after 574 00:27:06,690 --> 00:27:12,660 accessing this, since it's already in the cache, I get 575 00:27:12,660 --> 00:27:14,800 spatial locality because of the 16. 576 00:27:14,800 --> 00:27:16,600 16 things are in the cache, I am going through that. 577 00:27:16,600 --> 00:27:18,050 I get spatial locality. 578 00:27:18,050 --> 00:27:22,530 I get temporal locality because next time I access, 579 00:27:22,530 --> 00:27:23,580 it's still in the cache. 580 00:27:23,580 --> 00:27:26,380 I haven't taken anything out of the cache. 581 00:27:26,380 --> 00:27:27,930 OK, what kind of misses? 582 00:27:27,930 --> 00:27:30,960 583 00:27:30,960 --> 00:27:33,950 Should be cold misses again. 584 00:27:33,950 --> 00:27:35,650 Cold misses because the first time I [UNINTELLIGIBLE] 585 00:27:35,650 --> 00:27:35,890 things. 586 00:27:35,890 --> 00:27:36,860 I haven't seen that data. 587 00:27:36,860 --> 00:27:41,460 After I get it, everything is in the cache. 588 00:27:41,460 --> 00:27:44,700 So now, here's interesting case. 589 00:27:44,700 --> 00:27:49,010 Now, I am doing 2 to the power N, where N is great than 14. 590 00:27:49,010 --> 00:27:52,800 591 00:27:52,800 --> 00:27:55,120 Now, what happens? 592 00:27:55,120 --> 00:28:00,920 Now, what happens in our picture here is I am going-- 593 00:28:00,920 --> 00:28:05,180 594 00:28:05,180 --> 00:28:08,220 So as you might cache is somewhere here. 595 00:28:08,220 --> 00:28:13,470 At this point, I fill out the cache. 596 00:28:13,470 --> 00:28:18,010 I still keep going, and minute I access something 597 00:28:18,010 --> 00:28:19,300 [UNINTELLIGIBLE] 598 00:28:19,300 --> 00:28:20,550 cache, what's going to happen? 599 00:28:20,550 --> 00:28:24,630 600 00:28:24,630 --> 00:28:28,650 So I went through accessing 34 kilobyte. 601 00:28:28,650 --> 00:28:30,570 If I access the next one, what happens? 602 00:28:30,570 --> 00:28:34,420 603 00:28:34,420 --> 00:28:38,130 OK, no, no, no shutting down, please. 604 00:28:38,130 --> 00:28:39,380 OK, what happened? 605 00:28:39,380 --> 00:28:42,730 606 00:28:42,730 --> 00:28:45,940 So I am accessing more than I can fit into the cache. 607 00:28:45,940 --> 00:28:48,840 Next time access, what's going to happen? 608 00:28:48,840 --> 00:28:51,638 Anyone want to take a wild guess? 609 00:28:51,638 --> 00:28:52,888 AUDIENCE: [INAUDIBLE] 610 00:28:52,888 --> 00:28:55,530 611 00:28:55,530 --> 00:28:57,450 PROFESSOR: [UNINTELLIGIBLE PHRASE] 612 00:28:57,450 --> 00:28:57,840 before that. 613 00:28:57,840 --> 00:28:58,440 [UNINTELLIGIBLE] 614 00:28:58,440 --> 00:28:59,260 in there. 615 00:28:59,260 --> 00:29:04,410 So what happens is then am I ever going to have any 616 00:29:04,410 --> 00:29:07,120 temporal locality? 617 00:29:07,120 --> 00:29:08,330 No. 618 00:29:08,330 --> 00:29:14,590 OK, so now, first access to each line basically misses 619 00:29:14,590 --> 00:29:18,720 forever because by the time I get back to the 620 00:29:18,720 --> 00:29:19,880 [UNINTELLIGIBLE], it's not there. 621 00:29:19,880 --> 00:29:21,720 So basically, it's gone out of the cache. 622 00:29:21,720 --> 00:29:26,060 So what that means is my total access time is basically every 623 00:29:26,060 --> 00:29:30,720 16th element, I am getting cache miss forever. 624 00:29:30,720 --> 00:29:33,890 So what that means, every 15th of 16, I have a hit. 625 00:29:33,890 --> 00:29:40,070 So what I have is 15 out of 16, I have a cache hit. 626 00:29:40,070 --> 00:29:44,440 1/16th of the time, I have a cache miss forever. 627 00:29:44,440 --> 00:29:45,930 So what kind of locality do I have now? 628 00:29:45,930 --> 00:29:50,367 629 00:29:50,367 --> 00:29:52,250 I have spatial locality. 630 00:29:52,250 --> 00:29:53,520 I don't have any temporal locality. 631 00:29:53,520 --> 00:29:55,910 I don't ever go back to something in the cache again. 632 00:29:55,910 --> 00:29:58,280 So what I have is spatial locality, so what type of 633 00:29:58,280 --> 00:29:59,530 misses now do I have? 634 00:29:59,530 --> 00:30:03,679 635 00:30:03,679 --> 00:30:06,090 AUDIENCE: [INAUDIBLE] 636 00:30:06,090 --> 00:30:08,112 PROFESSOR: Cold, right. 637 00:30:08,112 --> 00:30:09,438 AUDIENCE: Is it like shared misses? 638 00:30:09,438 --> 00:30:09,880 [INAUDIBLE] 639 00:30:09,880 --> 00:30:10,770 PROFESSOR: It's not shared. 640 00:30:10,770 --> 00:30:12,930 When you fill out the cache, you go back to 641 00:30:12,930 --> 00:30:14,080 [UNINTELLIGIBLE]. 642 00:30:14,080 --> 00:30:15,690 You fill the cache. 643 00:30:15,690 --> 00:30:17,828 So what type of miss is that? 644 00:30:17,828 --> 00:30:18,716 AUDIENCE: Capacity? 645 00:30:18,716 --> 00:30:19,770 PROFESSOR: Capacity miss. 646 00:30:19,770 --> 00:30:22,910 OK, so you have basically cold and capacity 647 00:30:22,910 --> 00:30:25,360 misses happening now. 648 00:30:25,360 --> 00:30:25,695 OK? 649 00:30:25,695 --> 00:30:26,518 AUDIENCE: One question. 650 00:30:26,518 --> 00:30:27,982 [INAUDIBLE] 651 00:30:27,982 --> 00:30:31,642 that you multiplied whenever you're trying to load from 652 00:30:31,642 --> 00:30:35,302 memory, is that like an arbitrary number, or is that 653 00:30:35,302 --> 00:30:36,766 the actual cost of going-- 654 00:30:36,766 --> 00:30:39,520 PROFESSOR: Every machine, you can get this beautiful table. 655 00:30:39,520 --> 00:30:42,860 I will go off what your machines have. 656 00:30:42,860 --> 00:30:45,020 We still have a number saying this is your miss 657 00:30:45,020 --> 00:30:46,130 [UNINTELLIGIBLE], this is the thing-- 658 00:30:46,130 --> 00:30:50,120 I mean, so some of these things you realize because of 659 00:30:50,120 --> 00:30:50,780 all of the complexity. 660 00:30:50,780 --> 00:30:53,590 It's not that nice and simple, but this, I just pull out of 661 00:30:53,590 --> 00:30:55,570 hat, and it's kind of normal. 662 00:30:55,570 --> 00:30:59,860 663 00:30:59,860 --> 00:31:02,110 So what do I do, now? 664 00:31:02,110 --> 00:31:06,765 I am doing the mod, but I am multiplying [i] by 16. 665 00:31:06,765 --> 00:31:10,150 666 00:31:10,150 --> 00:31:14,450 So now, I am accessing this value, this value, this value, 667 00:31:14,450 --> 00:31:15,550 this value in here. 668 00:31:15,550 --> 00:31:17,590 I'm accessing this value, this value, this value, this value. 669 00:31:17,590 --> 00:31:22,490 One value in each cache line. 670 00:31:22,490 --> 00:31:23,440 OK? 671 00:31:23,440 --> 00:31:25,530 How much cache misses do you think I'm going to get? 672 00:31:25,530 --> 00:31:29,895 673 00:31:29,895 --> 00:31:31,145 AUDIENCE: [INAUDIBLE] 674 00:31:31,145 --> 00:31:33,300 675 00:31:33,300 --> 00:31:37,250 PROFESSOR: 100 for every access is basically beginning 676 00:31:37,250 --> 00:31:41,980 a cache line, you're taking one value and if I go back to 677 00:31:41,980 --> 00:31:45,230 that value, I have already filled up the cache. 678 00:31:45,230 --> 00:31:49,370 So basically, I have first access in the cache. 679 00:31:49,370 --> 00:31:53,700 And what's your total access time? 680 00:31:53,700 --> 00:31:54,790 Anyone will take a guess? 681 00:31:54,790 --> 00:31:55,996 AUDIENCE: [INAUDIBLE] 682 00:31:55,996 --> 00:31:58,650 PROFESSOR: Yep, 100 times x because I have no-- 683 00:31:58,650 --> 00:32:00,470 OK, so this is a [UNINTELLIGIBLE] 684 00:32:00,470 --> 00:32:01,090 locality. 685 00:32:01,090 --> 00:32:03,150 That was clear. 686 00:32:03,150 --> 00:32:05,143 What kind of misses am I getting now? 687 00:32:05,143 --> 00:32:06,130 AUDIENCE: [INAUDIBLE] 688 00:32:06,130 --> 00:32:07,390 PROFESSOR: Now, I'm getting conflict misses. 689 00:32:07,390 --> 00:32:09,420 I get a cold miss at the beginning, first time around, 690 00:32:09,420 --> 00:32:11,040 and then every time I do something, I'm getting 691 00:32:11,040 --> 00:32:15,870 conflict miss because the thing is, now, this N, I am 692 00:32:15,870 --> 00:32:18,340 only accessing small amount of data. 693 00:32:18,340 --> 00:32:23,090 It doesn't fit in the cache, but I'm not still getting any 694 00:32:23,090 --> 00:32:26,055 kind of sharing, so I'm having conflict misses. 695 00:32:26,055 --> 00:32:30,910 696 00:32:30,910 --> 00:32:32,540 [UNINTELLIGIBLE] second time, I guess I will 697 00:32:32,540 --> 00:32:33,500 probably jump over this. 698 00:32:33,500 --> 00:32:35,980 I just did OK, if I go random access, what happens? 699 00:32:35,980 --> 00:32:37,120 So let me jump over that. 700 00:32:37,120 --> 00:32:39,600 Actually can do a calculation to figure out how many, 701 00:32:39,600 --> 00:32:41,230 probabilistically, how much you might have 702 00:32:41,230 --> 00:32:41,860 and stuff like that. 703 00:32:41,860 --> 00:32:43,990 OK. 704 00:32:43,990 --> 00:32:48,790 So now, if you look at what's going on, when you have no 705 00:32:48,790 --> 00:32:51,280 locality, you'd have no locality. 706 00:32:51,280 --> 00:32:53,410 That's a pretty obvious statement. 707 00:32:53,410 --> 00:32:57,470 And then if you have spatial locality and no temporal 708 00:32:57,470 --> 00:32:59,560 locality, we are streaming data. 709 00:32:59,560 --> 00:33:02,150 That means you are going through the data, but you're 710 00:33:02,150 --> 00:33:04,910 not getting back to it fast enough, so I 711 00:33:04,910 --> 00:33:05,810 have temporal locality. 712 00:33:05,810 --> 00:33:09,060 So I stream through data, so that means I go through data 713 00:33:09,060 --> 00:33:11,170 in a streaming fashion. 714 00:33:11,170 --> 00:33:12,420 OK? 715 00:33:12,420 --> 00:33:16,820 716 00:33:16,820 --> 00:33:21,530 So what we have is if working set fits in cache, you have 717 00:33:21,530 --> 00:33:22,640 InCache mode. 718 00:33:22,640 --> 00:33:25,920 If working set is too big for cache, you can get some 719 00:33:25,920 --> 00:33:28,550 streaming mode and still get some locality because we are 720 00:33:28,550 --> 00:33:30,060 bringing the lines in here. 721 00:33:30,060 --> 00:33:34,800 And you can do other things like optimizers like 722 00:33:34,800 --> 00:33:37,230 prefetching we'll get to. 723 00:33:37,230 --> 00:33:40,060 And other issues [UNINTELLIGIBLE] 724 00:33:40,060 --> 00:33:46,130 this last one, but to deal with cache axises. 725 00:33:46,130 --> 00:33:50,820 So if you have more than one axis-- so here what I have is 726 00:33:50,820 --> 00:33:53,040 too nice arrays. 727 00:33:53,040 --> 00:33:57,860 [UNINTELLIGIBLE] is 2 to the power size array 2 to the 19 728 00:33:57,860 --> 00:33:59,430 power arrays. 729 00:33:59,430 --> 00:34:02,940 OK, so now what happens when I put this in the cache? 730 00:34:02,940 --> 00:34:04,370 Look what happens in here. 731 00:34:04,370 --> 00:34:10,080 So what happens is this array get mapped in this one. 732 00:34:10,080 --> 00:34:13,969 This array get maps this nicely in here. 733 00:34:13,969 --> 00:34:17,580 So what happens is this line can only be in this cache 734 00:34:17,580 --> 00:34:20,820 line, and this line also only can be in 735 00:34:20,820 --> 00:34:24,310 this line of the cache. 736 00:34:24,310 --> 00:34:25,969 Do you see what's going on, now? 737 00:34:25,969 --> 00:34:30,739 So if you do this one, basically every time you 738 00:34:30,739 --> 00:34:32,194 access something-- 739 00:34:32,194 --> 00:34:36,130 See, assume I'm accessing this data item, so I'm doing A, B. 740 00:34:36,130 --> 00:34:38,370 I access A, I get this line. 741 00:34:38,370 --> 00:34:40,317 I access B, what happens? 742 00:34:40,317 --> 00:34:41,510 AUDIENCE: [INAUDIBLE] 743 00:34:41,510 --> 00:34:42,550 PROFESSOR: So this [UNINTELLIGIBLE] is gone. 744 00:34:42,550 --> 00:34:42,920 It is gone. 745 00:34:42,920 --> 00:34:44,600 Now, I am accessing the next one in that 746 00:34:44,600 --> 00:34:47,829 cache line we made. 747 00:34:47,829 --> 00:34:50,757 AUDIENCE: How do you know what cache line-- or how 748 00:34:50,757 --> 00:34:51,733 [INAUDIBLE] 749 00:34:51,733 --> 00:34:52,709 cache? 750 00:34:52,709 --> 00:34:53,699 [INAUDIBLE] 751 00:34:53,699 --> 00:34:57,200 PROFESSOR: Normally, in language like C, if people two 752 00:34:57,200 --> 00:35:00,080 arrays next to each other, they should be mapped next to 753 00:35:00,080 --> 00:35:01,830 each other in memory. 754 00:35:01,830 --> 00:35:04,290 And if you want to be more adventurous, you can go to the 755 00:35:04,290 --> 00:35:08,070 assembly and then see what their locations is. 756 00:35:08,070 --> 00:35:10,650 So if you put A, A next to each other, you normally kind 757 00:35:10,650 --> 00:35:13,000 of get it next to each other. 758 00:35:13,000 --> 00:35:15,540 So what this one is this is pretty bad. 759 00:35:15,540 --> 00:35:19,450 I should have gotten at least some spatial locality. 760 00:35:19,450 --> 00:35:20,585 I'm getting nothing. 761 00:35:20,585 --> 00:35:22,000 Do you see what's going on here? 762 00:35:22,000 --> 00:35:26,140 763 00:35:26,140 --> 00:35:28,035 Do you know what's going on here? 764 00:35:28,035 --> 00:35:30,770 Know why I'm not getting any spatial locality? 765 00:35:30,770 --> 00:35:31,940 Because I am bouncing. 766 00:35:31,940 --> 00:35:34,410 When I access this one, this one is gone even though I have 767 00:35:34,410 --> 00:35:36,960 more data in the same cache line, so I'm kind of bouncing 768 00:35:36,960 --> 00:35:38,390 two cache lines because of that. 769 00:35:38,390 --> 00:35:40,600 What's a good solution for this? 770 00:35:40,600 --> 00:35:41,550 Back there. 771 00:35:41,550 --> 00:35:42,500 AUDIENCE: [INAUDIBLE] 772 00:35:42,500 --> 00:35:47,120 A and B could be mapped to the same cache line. 773 00:35:47,120 --> 00:35:49,330 PROFESSOR: The thing A is a nice 2 to the power. 774 00:35:49,330 --> 00:35:54,422 A size is a multiple of the cache size. 775 00:35:54,422 --> 00:35:54,860 AUDIENCE: [INAUDIBLE] 776 00:35:54,860 --> 00:35:59,140 PROFESSOR: The size of A is a multiple of the cache size. 777 00:35:59,140 --> 00:36:04,480 So what happens is if you have memory in here, you map 778 00:36:04,480 --> 00:36:06,775 something like A, C here. 779 00:36:06,775 --> 00:36:13,960 If this is a multiple of cache line, so assume this is 32K 780 00:36:13,960 --> 00:36:18,450 times some number, and you map B here, and then this is 781 00:36:18,450 --> 00:36:22,590 normally what happens in C. You put all adjacent map 782 00:36:22,590 --> 00:36:25,150 allocations next to each other in memory. 783 00:36:25,150 --> 00:36:27,830 Do you got A here, and you start having B in 784 00:36:27,830 --> 00:36:30,130 here, next to it. 785 00:36:30,130 --> 00:36:34,840 OK, so what happens is now if A starts to assume address, 786 00:36:34,840 --> 00:36:47,710 we'll say 100, this is 100 plus 32K times N, basically. 787 00:36:47,710 --> 00:36:55,680 And you do this mod 32K, this mod 32K, this same number. 788 00:36:55,680 --> 00:36:58,280 It maps into the same line. 789 00:36:58,280 --> 00:37:01,740 OK, so I kind of gave you the problem. 790 00:37:01,740 --> 00:37:03,752 What can be the solution? 791 00:37:03,752 --> 00:37:05,135 AUDIENCE: [INAUDIBLE] 792 00:37:05,135 --> 00:37:06,980 allocate one line. 793 00:37:06,980 --> 00:37:08,280 PROFESSOR: Yes, it's called padding. 794 00:37:08,280 --> 00:37:12,530 795 00:37:12,530 --> 00:37:14,950 OK, so I have no locality in here, let me go to the next 796 00:37:14,950 --> 00:37:15,920 slide, what kind of misses? 797 00:37:15,920 --> 00:37:17,500 I have cold and conflict. 798 00:37:17,500 --> 00:37:19,710 What I can do is just basically add a 799 00:37:19,710 --> 00:37:21,180 little bit of padding. 800 00:37:21,180 --> 00:37:22,320 I did 16 here. 801 00:37:22,320 --> 00:37:26,400 Normally, you do a prime number so it will not clash 802 00:37:26,400 --> 00:37:26,910 with anything. 803 00:37:26,910 --> 00:37:31,350 Add a little bit of padding at the end of array, and that 804 00:37:31,350 --> 00:37:33,230 means the next array will not be 805 00:37:33,230 --> 00:37:35,310 conflicted most of the time. 806 00:37:35,310 --> 00:37:37,710 So a lot of times when you declare things next to each 807 00:37:37,710 --> 00:37:39,630 other, it's always good to add a little bit of 808 00:37:39,630 --> 00:37:41,450 a padding in between. 809 00:37:41,450 --> 00:37:46,010 Normally, a smaller prime number would be a good thing 810 00:37:46,010 --> 00:37:47,260 to add a padding. 811 00:37:47,260 --> 00:37:50,620 812 00:37:50,620 --> 00:37:54,180 Now, I start getting back my nice locality because the two 813 00:37:54,180 --> 00:37:57,460 things are mapping the two cache lines. 814 00:37:57,460 --> 00:37:58,920 What type of locality do I get here? 815 00:37:58,920 --> 00:38:05,960 816 00:38:05,960 --> 00:38:07,470 Anybody want to take any guess what type of 817 00:38:07,470 --> 00:38:08,720 locality I have here? 818 00:38:08,720 --> 00:38:11,816 819 00:38:11,816 --> 00:38:13,220 AUDIENCE: Question, sorry. 820 00:38:13,220 --> 00:38:14,954 [INAUDIBLE] 821 00:38:14,954 --> 00:38:16,942 you actually added just an extra 16. 822 00:38:16,942 --> 00:38:17,439 PROFESSOR: Yes. 823 00:38:17,439 --> 00:38:21,420 AUDIENCE: Why would that just make the A and B [INAUDIBLE]? 824 00:38:21,420 --> 00:38:24,640 PROFESSOR: Because what happens is normally it doesn't 825 00:38:24,640 --> 00:38:29,830 make A and B into leaving the cache, but it makes A[0] 826 00:38:29,830 --> 00:38:33,800 not exactly matching to B[0]. 827 00:38:33,800 --> 00:38:35,200 A[0] 828 00:38:35,200 --> 00:38:39,670 will interleave with something like B[16] 829 00:38:39,670 --> 00:38:40,670 or something. 830 00:38:40,670 --> 00:38:43,232 This thing might map into the same place. 831 00:38:43,232 --> 00:38:47,480 AUDIENCE: I mean, but I [INAUDIBLE]. 832 00:38:47,480 --> 00:38:49,220 PROFESSOR: Yes, normally in a computation, 833 00:38:49,220 --> 00:38:50,120 people do that, A[i] 834 00:38:50,120 --> 00:38:50,670 equals B[i] 835 00:38:50,670 --> 00:38:52,246 plus something. 836 00:38:52,246 --> 00:38:53,725 AUDIENCE: Oh, OK, so [INAUDIBLE]. 837 00:38:53,725 --> 00:38:57,669 838 00:38:57,669 --> 00:38:59,312 Like I understand what you're saying, but I don't understand 839 00:38:59,312 --> 00:39:00,627 how that [INAUDIBLE] 840 00:39:00,627 --> 00:39:03,585 because they're both accessing [i] at the same time. 841 00:39:03,585 --> 00:39:04,835 Why would they [INAUDIBLE]? 842 00:39:04,835 --> 00:39:09,980 843 00:39:09,980 --> 00:39:13,290 PROFESSOR: So here's the problem. 844 00:39:13,290 --> 00:39:14,840 Assume I have [UNINTELLIGIBLE] 845 00:39:14,840 --> 00:39:16,710 32K. 846 00:39:16,710 --> 00:39:18,180 I'm assuming [UNINTELLIGIBLE]. 847 00:39:18,180 --> 00:39:25,600 OK, so first one is, we'll say, you start with 100, 100 848 00:39:25,600 --> 00:39:36,400 plus [i], and the other one is 100 plus 32K plus [i]. 849 00:39:36,400 --> 00:39:46,250 OK, then you take a mod 32K, and this end up 850 00:39:46,250 --> 00:39:50,740 being 100 plus [i] 851 00:39:50,740 --> 00:39:53,110 with another 100 plus [i]. 852 00:39:53,110 --> 00:39:55,760 So this is basically [UNINTELLIGIBLE]. 853 00:39:55,760 --> 00:39:58,950 You still have one or one element in the cache it's all 854 00:39:58,950 --> 00:40:01,890 going to map into. 855 00:40:01,890 --> 00:40:03,390 You see that? 856 00:40:03,390 --> 00:40:11,556 But now, if I add 16 more to this one, then this map into 857 00:40:11,556 --> 00:40:15,816 116 plus [i]. 858 00:40:15,816 --> 00:40:18,286 AUDIENCE: I don't see that reflected in the code. 859 00:40:18,286 --> 00:40:21,250 You just [INAUDIBLE], so how is B different 860 00:40:21,250 --> 00:40:22,240 from A in your code? 861 00:40:22,240 --> 00:40:24,450 PROFESSOR: Because [UNINTELLIGIBLE] my S. I added 862 00:40:24,450 --> 00:40:28,488 16 to S, so my A is a little bit bigger, now. 863 00:40:28,488 --> 00:40:30,280 AUDIENCE: But isn't B also [INAUDIBLE]? 864 00:40:30,280 --> 00:40:32,240 PROFESSOR: Yeah, B's also bigger here. 865 00:40:32,240 --> 00:40:32,870 I don't care. 866 00:40:32,870 --> 00:40:34,110 B goes down here. 867 00:40:34,110 --> 00:40:37,780 B, I add a little bit more to the B. It doesn't matter. 868 00:40:37,780 --> 00:40:40,010 I padded B, too. 869 00:40:40,010 --> 00:40:41,790 So what that means is the first A[i] 870 00:40:41,790 --> 00:40:45,282 and B[i] are not in the same cache line. 871 00:40:45,282 --> 00:40:47,505 AUDIENCE: How do we allocate A and B right 872 00:40:47,505 --> 00:40:48,740 next to each other? 873 00:40:48,740 --> 00:40:57,270 PROFESSOR: So normally in C, if you declare something like 874 00:40:57,270 --> 00:41:08,010 int A [100], int B [100], more or less, they will be 875 00:41:08,010 --> 00:41:09,700 allocated next to each other. 876 00:41:09,700 --> 00:41:12,620 If you look at the assembly listing, it'll say code 877 00:41:12,620 --> 00:41:15,640 allocated next to each other in memory. 878 00:41:15,640 --> 00:41:19,480 Even in the stack, if you do that, if you allocate things, 879 00:41:19,480 --> 00:41:24,265 compiler has no incentive to go and move things around. 880 00:41:24,265 --> 00:41:27,850 AUDIENCE: This is only for global and [INAUDIBLE]? 881 00:41:27,850 --> 00:41:30,172 PROFESSOR: Global variable, local variables. 882 00:41:30,172 --> 00:41:32,860 AUDIENCE: Does any of it still apply if you use malloc? 883 00:41:32,860 --> 00:41:38,520 884 00:41:38,520 --> 00:41:39,770 PROFESSOR: The thing about malloc is-- 885 00:41:39,770 --> 00:41:43,164 886 00:41:43,164 --> 00:41:44,150 AUDIENCE: [INAUDIBLE] 887 00:41:44,150 --> 00:41:45,612 PROFESSOR: It might do something [UNINTELLIGIBLE], 888 00:41:45,612 --> 00:41:48,760 but on the other hand, if it is fitting on the same page, 889 00:41:48,760 --> 00:41:50,460 it might also do something like that, too. 890 00:41:50,460 --> 00:41:53,120 So it depends on how we do that. 891 00:41:53,120 --> 00:41:56,400 At the beginning, if you keep mallocing things, it might be 892 00:41:56,400 --> 00:41:56,870 [UNINTELLIGIBLE]. 893 00:41:56,870 --> 00:42:00,660 They might be a little bit off because malloc put some 894 00:42:00,660 --> 00:42:02,920 metadata with each status. 895 00:42:02,920 --> 00:42:04,290 If you ask for [UNINTELLIGIBLE], you're not 896 00:42:04,290 --> 00:42:05,260 getting exactly 100. 897 00:42:05,260 --> 00:42:06,740 You're getting a little bit of a bigger size. 898 00:42:06,740 --> 00:42:09,930 But you might have some configuring to do, but after 899 00:42:09,930 --> 00:42:12,710 some time, when you have done a lot of [UNINTELLIGIBLE], we 900 00:42:12,710 --> 00:42:13,810 have no idea where it's going to go. 901 00:42:13,810 --> 00:42:15,290 It's random. 902 00:42:15,290 --> 00:42:17,530 Anymore questions? 903 00:42:17,530 --> 00:42:20,940 OK, good, that is a good question. 904 00:42:20,940 --> 00:42:22,790 So what kind of locality do I have here? 905 00:42:22,790 --> 00:42:29,030 906 00:42:29,030 --> 00:42:29,860 [UNINTELLIGIBLE] 907 00:42:29,860 --> 00:42:37,130 two lines, I am accessing each, and I got 16 times 15 908 00:42:37,130 --> 00:42:38,380 out of 16 hits. 909 00:42:38,380 --> 00:42:41,570 910 00:42:41,570 --> 00:42:42,650 What's that? 911 00:42:42,650 --> 00:42:43,540 AUDIENCE: Spatial? 912 00:42:43,540 --> 00:42:45,400 PROFESSOR: Spatial locality. 913 00:42:45,400 --> 00:42:47,990 And of course, if you have spatial locality, you get cold 914 00:42:47,990 --> 00:42:48,850 misses, basically. 915 00:42:48,850 --> 00:42:51,170 And I think [UNINTELLIGIBLE], you get what 916 00:42:51,170 --> 00:42:54,605 other type of misses? 917 00:42:54,605 --> 00:42:56,569 AUDIENCE: [INAUDIBLE] 918 00:42:56,569 --> 00:42:57,551 PROFESSOR: [UNINTELLIGIBLE] conflict here. 919 00:42:57,551 --> 00:42:58,540 Basically, capacity. 920 00:42:58,540 --> 00:43:01,260 Capacity miss, exactly. 921 00:43:01,260 --> 00:43:08,260 So in order to avoid this, what we have done is we make 922 00:43:08,260 --> 00:43:10,940 cache-- instead of mapping into one place before, and 923 00:43:10,940 --> 00:43:14,800 what we said was we had this memory, and 924 00:43:14,800 --> 00:43:16,160 assume we have 32 sets. 925 00:43:16,160 --> 00:43:20,270 And if you have a cache like this, we make this entire map 926 00:43:20,270 --> 00:43:24,100 into this one, and this map into this one. 927 00:43:24,100 --> 00:43:27,250 What people have done is instead of doing that, what 928 00:43:27,250 --> 00:43:30,500 you can make is make the cache, instead of one big 929 00:43:30,500 --> 00:43:34,436 block, two small blocks. 930 00:43:34,436 --> 00:43:40,890 So what that means is this can happen to this one. 931 00:43:40,890 --> 00:43:43,670 932 00:43:43,670 --> 00:43:46,150 My drawings are not that great, but I 933 00:43:46,150 --> 00:43:47,000 hope you get the picture. 934 00:43:47,000 --> 00:43:50,310 This map into this one, so what that means is each 935 00:43:50,310 --> 00:43:55,630 reference here, so maps here, and these 936 00:43:55,630 --> 00:43:56,940 two basically conflicts. 937 00:43:56,940 --> 00:43:58,690 But now with that [UNINTELLIGIBLE] 938 00:43:58,690 --> 00:44:01,610 too, I can put this one here or here. 939 00:44:01,610 --> 00:44:05,030 I have two places to put the value, basically, and that 940 00:44:05,030 --> 00:44:07,480 [UNINTELLIGIBLE]. 941 00:44:07,480 --> 00:44:11,260 OK, so that means if I access something conflicting, 942 00:44:11,260 --> 00:44:14,960 [UNINTELLIGIBLE], you cannot just two things. 943 00:44:14,960 --> 00:44:17,860 Still, there are places in the cache for them to go. 944 00:44:17,860 --> 00:44:20,510 And one way to do people is fully associative cache. 945 00:44:20,510 --> 00:44:23,550 It can go anywhere, but that's very hard to do in hardware, 946 00:44:23,550 --> 00:44:26,560 but you can do some small number of ways. 947 00:44:26,560 --> 00:44:30,480 So in here, if you do two basic associative cache, in 948 00:44:30,480 --> 00:44:35,610 the first axis in here, even though these are conflicted, 949 00:44:35,610 --> 00:44:37,090 on can go to one [UNINTELLIGIBLE] set, another 950 00:44:37,090 --> 00:44:38,340 one can go to the other set. 951 00:44:38,340 --> 00:44:41,070 952 00:44:41,070 --> 00:44:44,640 OK, so two associative cache [UNINTELLIGIBLE]. 953 00:44:44,640 --> 00:44:48,780 So total access time, again, I have nice locality in here, 954 00:44:48,780 --> 00:44:53,570 spatial locality, and I have both cold and, I guess, 955 00:44:53,570 --> 00:44:56,490 capacity misses. 956 00:44:56,490 --> 00:44:58,740 But if you have two associative cache, what's the 957 00:44:58,740 --> 00:44:59,890 next problem? 958 00:44:59,890 --> 00:45:01,310 How about if you have three different things? 959 00:45:01,310 --> 00:45:04,280 960 00:45:04,280 --> 00:45:06,070 So we have two different things like that. 961 00:45:06,070 --> 00:45:07,730 OK, now that I only have two associated [UNINTELLIGIBLE], 962 00:45:07,730 --> 00:45:10,630 it doesn't work because, in a [UNINTELLIGIBLE], I will 963 00:45:10,630 --> 00:45:13,290 replace the last thing, and you might still end up with 964 00:45:13,290 --> 00:45:15,830 nothing in here because you are replacing nothing, and the 965 00:45:15,830 --> 00:45:17,290 next thing replaced the other thing. 966 00:45:17,290 --> 00:45:20,450 And basically after you gave B, C, then you go 967 00:45:20,450 --> 00:45:21,450 to A. A Is not there. 968 00:45:21,450 --> 00:45:23,040 When you go to B, B is not there, C is not there. 969 00:45:23,040 --> 00:45:26,560 You kind of go back to the same problem in here, and then 970 00:45:26,560 --> 00:45:28,285 basically have to access everything again. 971 00:45:28,285 --> 00:45:33,570 972 00:45:33,570 --> 00:45:35,750 There is some spatial locality but cache can't use it. 973 00:45:35,750 --> 00:45:39,190 Basically, you have conflict misses. 974 00:45:39,190 --> 00:45:40,640 OK, so you see that? 975 00:45:40,640 --> 00:45:44,410 So associative is good because normally people assume you 976 00:45:44,410 --> 00:45:46,080 don't need that much 977 00:45:46,080 --> 00:45:49,590 associativity, but it's limited. 978 00:45:49,590 --> 00:45:51,430 So if you have too many things, you might run into 979 00:45:51,430 --> 00:45:54,416 this problem again. 980 00:45:54,416 --> 00:45:55,666 AUDIENCE: [INAUDIBLE] 981 00:45:55,666 --> 00:46:02,320 982 00:46:02,320 --> 00:46:03,308 PROFESSOR: You mean capacity? 983 00:46:03,308 --> 00:46:04,790 AUDIENCE: Yeah, [INAUDIBLE] capacity. 984 00:46:04,790 --> 00:46:05,778 PROFESSOR: Less. 985 00:46:05,778 --> 00:46:07,754 AUDIENCE: [INAUDIBLE] 986 00:46:07,754 --> 00:46:10,718 elements in a cache before you pick [INAUDIBLE] because it 987 00:46:10,718 --> 00:46:14,670 looks as though the matrix is is using half 988 00:46:14,670 --> 00:46:15,658 the cache by itself. 989 00:46:15,658 --> 00:46:17,510 PROFESSOR: Because it's associated, you mean? 990 00:46:17,510 --> 00:46:18,480 You have multiple there? 991 00:46:18,480 --> 00:46:21,030 I mean, the thing is since it's associated, no, because 992 00:46:21,030 --> 00:46:22,050 you have enough room. 993 00:46:22,050 --> 00:46:23,260 I mean, if you're going [UNINTELLIGIBLE] 994 00:46:23,260 --> 00:46:25,360 everything, you have two places to put the 995 00:46:25,360 --> 00:46:28,730 name because it maps. 996 00:46:28,730 --> 00:46:29,950 That's a good question. 997 00:46:29,950 --> 00:46:35,600 So what happens is now, assume this many things fit into the 998 00:46:35,600 --> 00:46:37,310 cache before. 999 00:46:37,310 --> 00:46:40,610 Now, cache has two banks, and in each bank, only this many 1000 00:46:40,610 --> 00:46:42,095 things fit in, but the next thing can go 1001 00:46:42,095 --> 00:46:43,080 to the other bank. 1002 00:46:43,080 --> 00:46:44,550 So if you go through [UNINTELLIGIBLE], you have the 1003 00:46:44,550 --> 00:46:47,620 same behavior even if you have associativity. 1004 00:46:47,620 --> 00:46:48,890 OK, you can go and-- 1005 00:46:48,890 --> 00:46:52,240 because if the cache is the same size, the size is what 1006 00:46:52,240 --> 00:46:54,767 matters for if you are going through [UNINTELLIGIBLE] axis. 1007 00:46:54,767 --> 00:46:56,552 AUDIENCE: So far, [INAUDIBLE] machines that we're working 1008 00:46:56,552 --> 00:47:00,773 on, do they have a fixed kind of cache, or 1009 00:47:00,773 --> 00:47:01,585 can we decide like-- 1010 00:47:01,585 --> 00:47:02,940 PROFESSOR: The fixed kind of cache. 1011 00:47:02,940 --> 00:47:04,800 It's built into the hardware. 1012 00:47:04,800 --> 00:47:06,650 I will show the table to say what type of 1013 00:47:06,650 --> 00:47:09,346 associativity is in there. 1014 00:47:09,346 --> 00:47:12,818 AUDIENCE: Is there not a way to say [INAUDIBLE] 1015 00:47:12,818 --> 00:47:14,306 three-way associative? 1016 00:47:14,306 --> 00:47:16,786 Can you have something on top of it so that it looks like 1017 00:47:16,786 --> 00:47:18,770 it's two-way associative or something? 1018 00:47:18,770 --> 00:47:22,790 PROFESSOR: I mean, you can do things like that in software, 1019 00:47:22,790 --> 00:47:23,810 but they're very expensive. 1020 00:47:23,810 --> 00:47:26,960 I mean, think about cache is trying to fool the cache in 1021 00:47:26,960 --> 00:47:30,710 that direction cannot be that easy because they are like 1022 00:47:30,710 --> 00:47:32,470 really hard-baked into hardware. that's a good 1023 00:47:32,470 --> 00:47:35,650 question, but there are things you can do, not exactly 1024 00:47:35,650 --> 00:47:38,690 modeling something, but kind of putting offsets and stuff 1025 00:47:38,690 --> 00:47:41,800 like that so it doesn't get into these kind of conflicts. 1026 00:47:41,800 --> 00:47:46,140 1027 00:47:46,140 --> 00:47:47,520 So yeah, I will answer this thing. 1028 00:47:47,520 --> 00:47:48,340 Why don't you have more? 1029 00:47:48,340 --> 00:47:51,950 Because if you go back to your [UNINTELLIGIBLE] 1030 00:47:51,950 --> 00:47:52,530 lectures. 1031 00:47:52,530 --> 00:47:54,720 Perhaps it might have explanation in it's hardware 1032 00:47:54,720 --> 00:47:56,590 why it's much harder to build. 1033 00:47:56,590 --> 00:48:00,730 And if you have the linked lists now-- 1034 00:48:00,730 --> 00:48:06,674 so assume I have linked lists we are going through in here. 1035 00:48:06,674 --> 00:48:12,450 And what's my [UNINTELLIGIBLE]? 1036 00:48:12,450 --> 00:48:15,220 Basically, depending on allocation and using some 1037 00:48:15,220 --> 00:48:17,980 things, so best case is everything is in cache, so I 1038 00:48:17,980 --> 00:48:19,400 have a small list. 1039 00:48:19,400 --> 00:48:22,760 And I go through again and again, round and around my 1040 00:48:22,760 --> 00:48:26,280 list, everything's in cache and [UNINTELLIGIBLE] 1041 00:48:26,280 --> 00:48:29,350 S. That's really great. 1042 00:48:29,350 --> 00:48:32,130 Next best is everything is adjacent. 1043 00:48:32,130 --> 00:48:34,830 OK, everything might not be in the cache, but everything at 1044 00:48:34,830 --> 00:48:35,760 least in adjacent memory. 1045 00:48:35,760 --> 00:48:40,120 So basically, first time I allocate my malloc did a nice 1046 00:48:40,120 --> 00:48:45,480 thing for me, and at that time, I basically get a nice 1047 00:48:45,480 --> 00:48:49,510 basic spatial locality since my size is 16 in here. 1048 00:48:49,510 --> 00:48:51,180 So I get this much. 1049 00:48:51,180 --> 00:48:53,390 And the worst case is random. 1050 00:48:53,390 --> 00:48:55,320 Malloc puts it all over the place. 1051 00:48:55,320 --> 00:48:55,660 Question? 1052 00:48:55,660 --> 00:48:57,127 AUDIENCE: [INAUDIBLE] 1053 00:48:57,127 --> 00:49:01,283 cache, does the thing behind it [UNINTELLIGIBLE] in front 1054 00:49:01,283 --> 00:49:02,506 of it [UNINTELLIGIBLE] also going to? 1055 00:49:02,506 --> 00:49:06,390 Do those both have a spatial locality? 1056 00:49:06,390 --> 00:49:16,290 PROFESSOR: What happens is normally, we assume 64 element 1057 00:49:16,290 --> 00:49:24,873 cache, so it's at that boundary, so basically 2 to 1058 00:49:24,873 --> 00:49:26,640 the 8 boundaries is what happens. 1059 00:49:26,640 --> 00:49:33,830 So what happens is cache line begins at 0 address, 64 1060 00:49:33,830 --> 00:49:34,310 address here. 1061 00:49:34,310 --> 00:49:38,570 So if you get 63, that means this value is not in the 1062 00:49:38,570 --> 00:49:42,120 cache, so you get this cache line. 1063 00:49:42,120 --> 00:49:43,680 So the cache line doesn't shift. 1064 00:49:43,680 --> 00:49:48,840 It's fixed chunks in here, so that if you get address 0, you 1065 00:49:48,840 --> 00:49:50,630 get this one, and all these things. 1066 00:49:50,630 --> 00:49:52,580 If you get something in the middle, you'll get the left 1067 00:49:52,580 --> 00:49:54,300 and right [UNINTELLIGIBLE], but if you get here, you only 1068 00:49:54,300 --> 00:49:55,790 get the right. 1069 00:49:55,790 --> 00:49:58,550 So this basically 0 and 64, they are boundaries 1070 00:49:58,550 --> 00:49:58,820 [INAUDIBLE]. 1071 00:49:58,820 --> 00:50:00,430 If you like 65, that's it. 1072 00:50:00,430 --> 00:50:03,640 But that said, if you're accessing [UNINTELLIGIBLE], 1073 00:50:03,640 --> 00:50:05,790 that's this thing called prefetcher that say you are 1074 00:50:05,790 --> 00:50:08,070 going through linear pattern, and you might need that next, 1075 00:50:08,070 --> 00:50:10,120 and it'll bring this up for you. 1076 00:50:10,120 --> 00:50:15,470 And what you find, I'll show later, that in fact, the 1077 00:50:15,470 --> 00:50:19,110 machine you're using has a really powerful prefetcher, so 1078 00:50:19,110 --> 00:50:20,270 it can hide a lot of these things. 1079 00:50:20,270 --> 00:50:21,520 So we'll get to that. 1080 00:50:21,520 --> 00:50:25,420 1081 00:50:25,420 --> 00:50:27,940 So this is my total access time in here. 1082 00:50:27,940 --> 00:50:28,820 Random case. 1083 00:50:28,820 --> 00:50:31,670 Random case is basically, every time I access something, 1084 00:50:31,670 --> 00:50:32,920 it's not in the cache. 1085 00:50:32,920 --> 00:50:38,340 1086 00:50:38,340 --> 00:50:42,720 And then if you have a bigger struct in here-- 1087 00:50:42,720 --> 00:50:45,690 but assume I have this big struct in here, but I am only 1088 00:50:45,690 --> 00:50:50,830 accessing one element of this struct in here. 1089 00:50:50,830 --> 00:50:51,400 I do a lot. 1090 00:50:51,400 --> 00:50:55,910 I created this large struct, and I am [UNINTELLIGIBLE] 1091 00:50:55,910 --> 00:50:58,120 things in the struct, and I just basically 1092 00:50:58,120 --> 00:51:00,750 looking at my data. 1093 00:51:00,750 --> 00:51:02,310 Only one element. 1094 00:51:02,310 --> 00:51:03,560 What's the problem here? 1095 00:51:03,560 --> 00:51:18,420 1096 00:51:18,420 --> 00:51:19,670 AUDIENCE: [INAUDIBLE] 1097 00:51:19,670 --> 00:51:22,340 1098 00:51:22,340 --> 00:51:25,180 PROFESSOR: Exactly, so I bring the entire cache line, but I'm 1099 00:51:25,180 --> 00:51:27,220 only using a little bit of it. 1100 00:51:27,220 --> 00:51:30,120 Because I bring all these things into the cache, it 1101 00:51:30,120 --> 00:51:33,260 comes with axis and that, but I'm only using data. 1102 00:51:33,260 --> 00:51:36,720 So basically, what happens is, if everything in the cache, I 1103 00:51:36,720 --> 00:51:38,440 have access time S. That's great. 1104 00:51:38,440 --> 00:51:40,990 That's not a problem, [UNINTELLIGIBLE] 1105 00:51:40,990 --> 00:51:42,240 streaming. 1106 00:51:42,240 --> 00:51:44,990 1107 00:51:44,990 --> 00:51:47,920 Even though I'm getting spatial locality, I only get 1108 00:51:47,920 --> 00:51:50,810 half of it because even though I'm bringing the cache line, 1109 00:51:50,810 --> 00:51:54,630 only two have datas because this is 32 bytes long. 1110 00:51:54,630 --> 00:51:57,150 It's in my cache. 1111 00:51:57,150 --> 00:52:01,910 Even though I am bringing all this data, I only two of them 1112 00:52:01,910 --> 00:52:04,600 in the cache because rest of this other data is kind of 1113 00:52:04,600 --> 00:52:07,370 hanging in there with me, and so I don't get that. 1114 00:52:07,370 --> 00:52:10,610 And random, of course, you have no locality. 1115 00:52:10,610 --> 00:52:13,320 So how can I basically make this better? 1116 00:52:13,320 --> 00:52:16,010 1117 00:52:16,010 --> 00:52:17,010 How can I make this better? 1118 00:52:17,010 --> 00:52:19,520 OK, I just showed it very fast. 1119 00:52:19,520 --> 00:52:20,263 OK. 1120 00:52:20,263 --> 00:52:21,513 AUDIENCE: [INAUDIBLE] 1121 00:52:21,513 --> 00:52:23,510 1122 00:52:23,510 --> 00:52:26,350 PROFESSOR: So instead, if I ask you, I have this large 1123 00:52:26,350 --> 00:52:29,560 data structure, but most of time, I'm only accessing a 1124 00:52:29,560 --> 00:52:36,040 small part of it, this is kind of a little bit of a hack. 1125 00:52:36,040 --> 00:52:40,370 I mean, this is kind of going against, I guess 6005 kind of 1126 00:52:40,370 --> 00:52:42,120 building, but this actually works. 1127 00:52:42,120 --> 00:52:45,700 What you can do is, you can go back and instead of doing 1128 00:52:45,700 --> 00:52:48,010 structs, what you can do is you can do arrays 1129 00:52:48,010 --> 00:52:51,040 of each data item. 1130 00:52:51,040 --> 00:52:53,750 And now what happens is that I am basically bringing these, 1131 00:52:53,750 --> 00:52:56,050 and then the data I am accessing is basically right 1132 00:52:56,050 --> 00:52:57,300 now next to each other. 1133 00:52:57,300 --> 00:52:59,700 1134 00:52:59,700 --> 00:53:02,660 So you might have this very large data structure 1135 00:53:02,660 --> 00:53:04,880 representing something you have a lot of, then. 1136 00:53:04,880 --> 00:53:07,720 But most of the time, you're only using few of them, so you 1137 00:53:07,720 --> 00:53:09,050 don't have to be structs of arrays. 1138 00:53:09,050 --> 00:53:11,930 You might have couple of different data structures. 1139 00:53:11,930 --> 00:53:15,370 One is the one that you access regularly hopefully you get 1140 00:53:15,370 --> 00:53:16,460 them next to each other. 1141 00:53:16,460 --> 00:53:18,635 And the nice thing about arrays is it guarantees things 1142 00:53:18,635 --> 00:53:19,220 are next to each other. 1143 00:53:19,220 --> 00:53:21,144 Of course, was managing it is harder. 1144 00:53:21,144 --> 00:53:22,394 AUDIENCE: [UNINTELLIGIBLE] 1145 00:53:22,394 --> 00:53:26,358 1146 00:53:26,358 --> 00:53:27,275 PROFESSOR: You can do both. 1147 00:53:27,275 --> 00:53:30,260 So that means instead of doing the structs, you use arrays of 1148 00:53:30,260 --> 00:53:32,440 each field. 1149 00:53:32,440 --> 00:53:33,405 AUDIENCE: [INAUDIBLE] 1150 00:53:33,405 --> 00:53:34,495 PROFESSOR: Instead of. 1151 00:53:34,495 --> 00:53:37,249 So instead of doing this one, you just do this one. 1152 00:53:37,249 --> 00:53:38,247 AUDIENCE: [INAUDIBLE] 1153 00:53:38,247 --> 00:53:41,740 instead of being able to pass stuff over. 1154 00:53:41,740 --> 00:53:45,013 Now you have to kind of be able to know [UNINTELLIGIBLE], 1155 00:53:45,013 --> 00:53:48,288 so what that is where because you cannot just pass the 1156 00:53:48,288 --> 00:53:49,370 struct [UNINTELLIGIBLE]. 1157 00:53:49,370 --> 00:53:50,410 PROFESSOR: No, you can't pass the struct. 1158 00:53:50,410 --> 00:53:52,500 You basically have to pad the index in here for this 1159 00:53:52,500 --> 00:53:53,500 structure, basically. 1160 00:53:53,500 --> 00:53:54,660 So you have to know something in there. 1161 00:53:54,660 --> 00:53:58,870 So this is a lot messier, but it can actually give you a 1162 00:53:58,870 --> 00:54:00,990 performance if you do this type of access, so it's not 1163 00:54:00,990 --> 00:54:01,920 something you want everywhere. 1164 00:54:01,920 --> 00:54:06,130 That's why you still want to use structures and nice 1165 00:54:06,130 --> 00:54:06,830 classes and stuff. 1166 00:54:06,830 --> 00:54:11,580 But this, if you're doing very large data, but most of the 1167 00:54:11,580 --> 00:54:13,910 you're accessing only small amount, this kind of 1168 00:54:13,910 --> 00:54:17,750 representation can help. 1169 00:54:17,750 --> 00:54:20,850 So I want to get a little bit into matrix multiplying. 1170 00:54:20,850 --> 00:54:26,420 So what we're looking at is this four by four matrix, and 1171 00:54:26,420 --> 00:54:29,730 you can represent it using this representation. 1172 00:54:29,730 --> 00:54:33,150 But to make it clear, what I'm going to do is I'm just going 1173 00:54:33,150 --> 00:54:35,650 to allocate this entire thing as a single array, and I'm 1174 00:54:35,650 --> 00:54:37,100 going to do that [UNINTELLIGIBLE] calculation 1175 00:54:37,100 --> 00:54:43,080 myself because if you go a little bit down, you realize 1176 00:54:43,080 --> 00:54:46,910 this is allocated in a little bit of a weird way. 1177 00:54:46,910 --> 00:54:52,040 And so this is what we call row-major because all rows are 1178 00:54:52,040 --> 00:54:54,670 next to each other, and you can access allocated 1179 00:54:54,670 --> 00:54:55,870 column-major. 1180 00:54:55,870 --> 00:55:00,790 Basically that means now A001, instead of that, 1, 0 is next 1181 00:55:00,790 --> 00:55:04,180 to 0, 0, 1, 0, 2, 0. 1182 00:55:04,180 --> 00:55:06,860 And then basically saying same thing, [UNINTELLIGIBLE] 1183 00:55:06,860 --> 00:55:10,230 instead of saying 4i plus j, it's just 4j plus i. 1184 00:55:10,230 --> 00:55:12,340 You can do both. 1185 00:55:12,340 --> 00:55:18,460 So this is my nice matrix multiply function, because the 1186 00:55:18,460 --> 00:55:21,240 reason I did it this way, it's just very clear how many axis 1187 00:55:21,240 --> 00:55:23,810 I'm going to use. 1188 00:55:23,810 --> 00:55:28,920 So if you look at C, let's look at only the inner loop. 1189 00:55:28,920 --> 00:55:30,780 Most of the mapping matters in the inner loop. 1190 00:55:30,780 --> 00:55:35,820 I go for k in 0 to size. 1191 00:55:35,820 --> 00:55:40,830 Only first axis on C is going to miss because it's going up 1192 00:55:40,830 --> 00:55:41,060 [UNINTELLIGIBLE] 1193 00:55:41,060 --> 00:55:44,050 same K, da-da-da-da-da, so C is very nice in here. 1194 00:55:44,050 --> 00:55:45,330 Only get one miss in here. 1195 00:55:45,330 --> 00:55:46,273 You have a question? 1196 00:55:46,273 --> 00:55:47,120 OK. 1197 00:55:47,120 --> 00:55:51,680 A has a streaming pattern because every time I update K, 1198 00:55:51,680 --> 00:55:53,360 I go to the next element, next element, next 1199 00:55:53,360 --> 00:55:55,360 element, next element. 1200 00:55:55,360 --> 00:55:57,090 [INAUDIBLE]. 1201 00:55:57,090 --> 00:55:59,828 What type of a pattern do I have in B? 1202 00:55:59,828 --> 00:56:01,510 AUDIENCE: Looks like [INAUDIBLE]. 1203 00:56:01,510 --> 00:56:01,970 PROFESSOR: Yeah, exactly. 1204 00:56:01,970 --> 00:56:03,270 I have jumping on memory. 1205 00:56:03,270 --> 00:56:04,510 I'm jumping one to another. 1206 00:56:04,510 --> 00:56:07,590 So if you look at that, I am going through A like that, and 1207 00:56:07,590 --> 00:56:11,730 B, and going through C element at a time, basically. 1208 00:56:11,730 --> 00:56:15,360 So if you look at doing this in order to calculate one 1209 00:56:15,360 --> 00:56:18,710 element of A, [INAUDIBLE] calculating these, what you 1210 00:56:18,710 --> 00:56:21,850 have to do is I am getting B and C this way. 1211 00:56:21,850 --> 00:56:24,820 I'm going through row here and column, and this row 1212 00:56:24,820 --> 00:56:25,500 [UNINTELLIGIBLE] 1213 00:56:25,500 --> 00:56:27,510 fit into the cache line through that because this is 1214 00:56:27,510 --> 00:56:28,230 the memory. 1215 00:56:28,230 --> 00:56:31,970 But C is just jumping all over the memory, and I have no 1216 00:56:31,970 --> 00:56:34,200 locality in here. 1217 00:56:34,200 --> 00:56:36,490 So what can we do here? 1218 00:56:36,490 --> 00:56:38,290 What's the simplest thing in here? 1219 00:56:38,290 --> 00:56:40,069 How do we get some locality into the C? 1220 00:56:40,069 --> 00:56:46,166 1221 00:56:46,166 --> 00:56:48,360 AUDIENCE: [INAUDIBLE] 1222 00:56:48,360 --> 00:56:51,050 PROFESSOR: So yeah, you're coming, you're getting it. 1223 00:56:51,050 --> 00:56:53,170 That's the next thing that's most important. 1224 00:56:53,170 --> 00:56:54,460 That's the more advanced thing you can do. 1225 00:56:54,460 --> 00:56:55,280 We'll get there. 1226 00:56:55,280 --> 00:56:58,298 What was a simple thing you can do for C? 1227 00:56:58,298 --> 00:56:59,130 AUDIENCE: [INAUDIBLE] 1228 00:56:59,130 --> 00:57:00,380 PROFESSOR: [UNINTELLIGIBLE] 1229 00:57:00,380 --> 00:57:02,540 1230 00:57:02,540 --> 00:57:03,630 So here's the [UNINTELLIGIBLE] 1231 00:57:03,630 --> 00:57:07,440 execution time, original, and voila, when you transpose 1232 00:57:07,440 --> 00:57:10,830 that, you can actually get a huge benefit in here because, 1233 00:57:10,830 --> 00:57:14,950 now, I get spatial locality in both sides in here. 1234 00:57:14,950 --> 00:57:16,570 And so now when you're transposed, I am going 1235 00:57:16,570 --> 00:57:18,630 through like this. 1236 00:57:18,630 --> 00:57:22,500 So now what I did was I am giving this CPI, and what 1237 00:57:22,500 --> 00:57:24,680 happens is when you transpose, I have a 5x 1238 00:57:24,680 --> 00:57:26,020 improvement in sync CPI. 1239 00:57:26,020 --> 00:57:28,570 That means instructions actually not waiting, not 1240 00:57:28,570 --> 00:57:29,460 doing nothing. 1241 00:57:29,460 --> 00:57:34,900 I have a pretty good reduce of cache misread. 1242 00:57:34,900 --> 00:57:36,340 because, now, I'm actually going 1243 00:57:36,340 --> 00:57:38,640 through cache very nicely. 1244 00:57:38,640 --> 00:57:41,090 So the next thing we can do is, what she just 1245 00:57:41,090 --> 00:57:42,420 pointed out, blocking. 1246 00:57:42,420 --> 00:57:45,560 So instead of matrix multiply, just multiplying this matrix 1247 00:57:45,560 --> 00:57:48,920 at a time, you can multiply submatrices and calculate 1248 00:57:48,920 --> 00:57:50,260 small submatrices. 1249 00:57:50,260 --> 00:57:52,950 And if you do that, the submatrix can nicely fit in 1250 00:57:52,950 --> 00:57:54,440 the memory, so let me show you that. 1251 00:57:54,440 --> 00:57:56,510 This is the blocking code. 1252 00:57:56,510 --> 00:57:58,780 Let me show you what that means. 1253 00:57:58,780 --> 00:58:02,580 So what we are doing is you are multiplying this matrix by 1254 00:58:02,580 --> 00:58:06,030 this matrix in there and getting this matrix. 1255 00:58:06,030 --> 00:58:09,240 So what that means is, normally, what you do is, to 1256 00:58:09,240 --> 00:58:13,490 get one element, you have to multiply a row and a column. 1257 00:58:13,490 --> 00:58:16,200 Normally, we are calculating a row of A at a time. 1258 00:58:16,200 --> 00:58:20,420 In order to get a row of A, I had to get a row of B 1259 00:58:20,420 --> 00:58:20,860 [UNINTELLIGIBLE] 1260 00:58:20,860 --> 00:58:24,520 entire C has to participate into calculate A. So by the 1261 00:58:24,520 --> 00:58:28,440 time I have calculated 1024 elements, I have gone through 1262 00:58:28,440 --> 00:58:32,230 this many axises to new elements basically because I 1263 00:58:32,230 --> 00:58:34,970 am getting B again and again, but I had to go through the 1264 00:58:34,970 --> 00:58:36,310 entire C in here. 1265 00:58:36,310 --> 00:58:40,550 So if this much data is bigger than the cache, I have no 1266 00:58:40,550 --> 00:58:42,830 locality in here. 1267 00:58:42,830 --> 00:58:47,590 But if, instead of calculating a row, if I calculate a 32 by 1268 00:58:47,590 --> 00:58:52,570 32 block, which is again 1024 elements I am calculating. 1269 00:58:52,570 --> 00:58:56,050 If I do that, I need a block here and a block here, and I 1270 00:58:56,050 --> 00:58:57,760 only type this much data. 1271 00:58:57,760 --> 00:59:01,200 1272 00:59:01,200 --> 00:59:05,610 OK, I'm getting the same amount of output, but because 1273 00:59:05,610 --> 00:59:11,150 of reuse, I touch a lot less inputting here. 1274 00:59:11,150 --> 00:59:13,490 And just by doing this transformation, I can really 1275 00:59:13,490 --> 00:59:16,540 reduce the amount of data. 1276 00:59:16,540 --> 00:59:19,330 Basically, I can create a smaller working set, go 1277 00:59:19,330 --> 00:59:22,240 through that much more, and reduce the amount of cache 1278 00:59:22,240 --> 00:59:23,150 misses you get. 1279 00:59:23,150 --> 00:59:26,750 So by doing that, I basically go out another 3x improvement 1280 00:59:26,750 --> 00:59:30,140 in my CPI, really brought down my L1 cache misses, and 1281 00:59:30,140 --> 00:59:34,070 basically eliminated L2 cache misses. 1282 00:59:34,070 --> 00:59:38,280 OK, so I got a really good speed up in here. 1283 00:59:38,280 --> 00:59:41,370 So you can block multiple levels, L1, L2, L3, because 1284 00:59:41,370 --> 00:59:43,750 every time you figure out the workings and size, you block 1285 00:59:43,750 --> 00:59:48,550 so the data in that will be safe in the cache in here. 1286 00:59:48,550 --> 00:59:51,160 And this is kind of nasty to do by hand. 1287 00:59:51,160 --> 00:59:52,410 We'll see. 1288 00:59:52,410 --> 00:59:55,400 1289 00:59:55,400 --> 00:59:57,840 So the other interesting thing is call stack locality. 1290 00:59:57,840 --> 01:00:02,800 So if you do something like fib in here, traditional fib, 1291 01:00:02,800 --> 01:00:06,100 if you do fib(4), you calculate a fib(4), fib(3), 1292 01:00:06,100 --> 01:00:08,330 fib(2), you kind of build a stack, go up 1293 01:00:08,330 --> 01:00:10,830 back again in here. 1294 01:00:10,830 --> 01:00:14,020 So if you keep doing that, the nice thing is you're 1295 01:00:14,020 --> 01:00:17,620 going up, but its-- 1296 01:00:17,620 --> 01:00:20,050 Hopefully, one of these days, I am going to hit the wrong 1297 01:00:20,050 --> 01:00:21,300 button, and-- 1298 01:00:21,300 --> 01:00:28,520 1299 01:00:28,520 --> 01:00:31,110 What kind of locality do we have if we are accessing data 1300 01:00:31,110 --> 01:00:32,930 like this in my call stack? 1301 01:00:32,930 --> 01:00:38,090 1302 01:00:38,090 --> 01:00:39,650 What kind of locality do we have? 1303 01:00:39,650 --> 01:00:43,650 1304 01:00:43,650 --> 01:00:47,170 I mean, I access here, fib(4), and I get back to fib(4) here. 1305 01:00:47,170 --> 01:00:51,320 1306 01:00:51,320 --> 01:00:54,890 So within each frame, I am accessing data next to each 1307 01:00:54,890 --> 01:01:01,810 other, so I can have a little bit of a spatial locality 1308 01:01:01,810 --> 01:01:04,640 because, each frame, all the data is next to each other. 1309 01:01:04,640 --> 01:01:07,780 But across frames, normally, I will have some temporal 1310 01:01:07,780 --> 01:01:11,110 because I will get back to that data some time. 1311 01:01:11,110 --> 01:01:12,360 OK? 1312 01:01:12,360 --> 01:01:18,790 1313 01:01:18,790 --> 01:01:21,460 OK, before I summarize, this one. 1314 01:01:21,460 --> 01:01:26,240 If you are reading a lot of data, processing through stage 1315 01:01:26,240 --> 01:01:27,590 one, [UNINTELLIGIBLE] 1316 01:01:27,590 --> 01:01:29,680 and producing [UNINTELLIGIBLE], that means 1317 01:01:29,680 --> 01:01:32,000 you're reading a huge amount of data through stage one, 1318 01:01:32,000 --> 01:01:37,600 save, save data again, do stage two, you have really bad 1319 01:01:37,600 --> 01:01:38,780 cache locality. 1320 01:01:38,780 --> 01:01:41,950 One thing you can do is read a small chunk, do all the stages 1321 01:01:41,950 --> 01:01:44,970 on that small chunk, save it, and go to another chunk. 1322 01:01:44,970 --> 01:01:46,890 So instead of doing the entire data, doing something 1323 01:01:46,890 --> 01:01:47,370 [UNINTELLIGIBLE] 1324 01:01:47,370 --> 01:01:50,550 producing, you can do this chunk-ifying, and by doing 1325 01:01:50,550 --> 01:01:52,220 that, you can get a lot of locality. 1326 01:01:52,220 --> 01:01:56,100 So once you bring in data, you like to do as many things as 1327 01:01:56,100 --> 01:01:59,620 possible to that data before it gets out of the cache. 1328 01:01:59,620 --> 01:02:01,260 That's the entire thing about locality, and you can 1329 01:02:01,260 --> 01:02:02,810 restructure code through that. 1330 01:02:02,810 --> 01:02:07,680 So if you look at that, what we have done today, we have 1331 01:02:07,680 --> 01:02:10,510 looked at things like associativity, cache lines, 1332 01:02:10,510 --> 01:02:16,035 cache sets in here, and how to address and map to a cache, we 1333 01:02:16,035 --> 01:02:18,780 kind of did this thing, and we look at [UNINTELLIGIBLE] 1334 01:02:18,780 --> 01:02:21,550 streamings versus InCache versus no 1335 01:02:21,550 --> 01:02:24,250 locality type patterns. 1336 01:02:24,250 --> 01:02:27,540 And then when you look at data structure transformation, we 1337 01:02:27,540 --> 01:02:31,390 look at how to basically replace structured arrays and 1338 01:02:31,390 --> 01:02:35,130 split data structures to make sure that you get a lot of 1339 01:02:35,130 --> 01:02:40,230 nice locality if you access a certain amount of data, and 1340 01:02:40,230 --> 01:02:43,200 pack data to get smaller cache footprints kind of things. 1341 01:02:43,200 --> 01:02:45,760 And we also look at computer transformations, things like 1342 01:02:45,760 --> 01:02:48,840 transpose copying, compute copy, outtype things you can 1343 01:02:48,840 --> 01:02:53,040 do, so you can actually get nice locality, so you can do 1344 01:02:53,040 --> 01:02:55,510 both data and compute transformations. 1345 01:02:55,510 --> 01:02:57,300 And [UNINTELLIGIBLE] 1346 01:02:57,300 --> 01:03:01,360 axises and [? re-cast ?] 1347 01:03:01,360 --> 01:03:04,570 calculation stages, so this is kind of summary. 1348 01:03:04,570 --> 01:03:07,560 Next thing I want go is look at what machines 1349 01:03:07,560 --> 01:03:08,790 you guys are using. 1350 01:03:08,790 --> 01:03:11,810 So this is the machines we used last year. 1351 01:03:11,810 --> 01:03:15,390 These are quad cores, and it had L1 cache, so these 1352 01:03:15,390 --> 01:03:16,130 [UNINTELLIGIBLE]. 1353 01:03:16,130 --> 01:03:20,800 So this has 32 kilobyte L1 cache, data cache 1354 01:03:20,800 --> 01:03:21,520 [UNINTELLIGIBLE] 1355 01:03:21,520 --> 01:03:24,740 two caches in here for data and instructions, each core, 1356 01:03:24,740 --> 01:03:27,890 and then the cache line size is normally 64 bytes 1357 01:03:27,890 --> 01:03:28,930 [UNINTELLIGIBLE]. 1358 01:03:28,930 --> 01:03:31,690 And here, latency [UNINTELLIGIBLE], you get 1359 01:03:31,690 --> 01:03:33,780 three cycles. 1360 01:03:33,780 --> 01:03:41,420 If not, you get 14 cycles to go to to L2 in here if data 1361 01:03:41,420 --> 01:03:42,360 doesn't meet there. 1362 01:03:42,360 --> 01:03:43,700 And associated with this [UNINTELLIGIBLE PHRASE]. 1363 01:03:43,700 --> 01:03:46,330 1364 01:03:46,330 --> 01:03:48,690 So that was what we did last year. 1365 01:03:48,690 --> 01:03:52,390 This year, we have a bigger machine that has multiple 1366 01:03:52,390 --> 01:03:53,140 levels of caches. 1367 01:03:53,140 --> 01:03:57,490 So each core has L1 instruction and data cache. 1368 01:03:57,490 --> 01:04:00,160 It's kind of the same, 32 kilobyte cache, four 1369 01:04:00,160 --> 01:04:02,090 nanosecond latency in here. 1370 01:04:02,090 --> 01:04:03,930 Eight, three and four [UNINTELLIGIBLE], they kind of 1371 01:04:03,930 --> 01:04:10,370 reduced that, but they added a 256-kilobyte L2 cache in here, 1372 01:04:10,370 --> 01:04:13,120 and then a pretty large humongo 1373 01:04:13,120 --> 01:04:16,280 12-megabyte L3 cache in here. 1374 01:04:16,280 --> 01:04:19,270 The interesting thing is here, this pretty fast access to 1375 01:04:19,270 --> 01:04:19,910 these ones. 1376 01:04:19,910 --> 01:04:27,410 If you miss in here, this is about 2 and 1/2 times cost 1377 01:04:27,410 --> 01:04:29,190 from going from here to here. 1378 01:04:29,190 --> 01:04:33,780 From here to [UNINTELLIGIBLE], about a five times cost, so 1379 01:04:33,780 --> 01:04:35,710 here to here is a five times cost. 1380 01:04:35,710 --> 01:04:37,835 The interesting thing that actually surprises a lot us 1381 01:04:37,835 --> 01:04:40,410 was here to here, it's cost is not that different. 1382 01:04:40,410 --> 01:04:42,420 [UNINTELLIGIBLE PHRASE] 1383 01:04:42,420 --> 01:04:44,435 So actually memory, even though it's sitting out in 1384 01:04:44,435 --> 01:04:48,520 there, they have managed to get it pretty cheap access. 1385 01:04:48,520 --> 01:04:50,970 Normally, my intuition say this should be much higher 1386 01:04:50,970 --> 01:04:54,200 because you're going off-chip, so this is the structure here. 1387 01:04:54,200 --> 01:04:56,470 It's a lot more complicated. 1388 01:04:56,470 --> 01:04:58,930 So here's something I produced last year. 1389 01:04:58,930 --> 01:05:02,830 So what I did was I ran this program. 1390 01:05:02,830 --> 01:05:06,210 And in these machines, there's a way to turn off the prefetch 1391 01:05:06,210 --> 01:05:08,060 of the previous [UNINTELLIGIBLE]. 1392 01:05:08,060 --> 01:05:09,940 So when you've turned on the prefetch and then you run, 1393 01:05:09,940 --> 01:05:13,230 what you are doing is I am accessing data with a working 1394 01:05:13,230 --> 01:05:15,520 set, small working set, little bit bigger working set. 1395 01:05:15,520 --> 01:05:19,360 So this is basically when you consider working set size. 1396 01:05:19,360 --> 01:05:21,890 So you get a graph like this. 1397 01:05:21,890 --> 01:05:28,120 I am hitting my L1 cache, and if it is more than that, I 1398 01:05:28,120 --> 01:05:29,170 [UNINTELLIGIBLE] 1399 01:05:29,170 --> 01:05:30,520 L1 cache, I go to my L2 cache. 1400 01:05:30,520 --> 01:05:34,440 1401 01:05:34,440 --> 01:05:37,900 And if L2 misses, I have to go to main memory. 1402 01:05:37,900 --> 01:05:40,320 Very traditional kind of a graph in here. 1403 01:05:40,320 --> 01:05:43,450 1404 01:05:43,450 --> 01:05:46,160 Another a little bit of an interesting thing, which is 1405 01:05:46,160 --> 01:05:47,850 instead of accessing next item, I kind 1406 01:05:47,850 --> 01:05:48,730 of skipped in here. 1407 01:05:48,730 --> 01:05:57,590 So this trying to force how to get conflict misses. 1408 01:05:57,590 --> 01:06:02,110 And so if you line up things like this exactly, what 1409 01:06:02,110 --> 01:06:04,470 happens is you get these conflict misses, and you get 1410 01:06:04,470 --> 01:06:06,400 this number. 1411 01:06:06,400 --> 01:06:07,760 So I have interesting story to tell. 1412 01:06:07,760 --> 01:06:09,860 When I was a graduate student, we were running this bunch of 1413 01:06:09,860 --> 01:06:14,000 benchmarks, and normally when you run a benchmark, you try 1414 01:06:14,000 --> 01:06:15,240 it for an equal 1415 01:06:15,240 --> 01:06:18,820 [UNINTELLIGIBLE], stuff like that. 1416 01:06:18,820 --> 01:06:21,310 And we got this graph like this. 1417 01:06:21,310 --> 01:06:26,410 I mean, we expect a graph going like this. 1418 01:06:26,410 --> 01:06:28,450 OK, that how a benchmark [UNINTELLIGIBLE]. 1419 01:06:28,450 --> 01:06:32,080 Then at one point, by accident, instead of typing an 1420 01:06:32,080 --> 01:06:37,120 equal 64-something, I typed an equal 63, and suddenly, I got 1421 01:06:37,120 --> 01:06:39,130 a performance like this. 1422 01:06:39,130 --> 01:06:41,250 Wait a minute, what's going on? 1423 01:06:41,250 --> 01:06:43,420 So in here, what I got was not a line. 1424 01:06:43,420 --> 01:06:49,020 I got 16, 32, 64, 128, nice graph like that. 1425 01:06:49,020 --> 01:06:51,540 And then I just did 63, like down to here. 1426 01:06:51,540 --> 01:06:53,390 I'm like, geeze, that can't be right. 1427 01:06:53,390 --> 01:06:55,500 And then I start looking at 62, which is here. 1428 01:06:55,500 --> 01:06:56,770 61 is here. 1429 01:06:56,770 --> 01:07:00,720 So I realize if I enumerate all those things, I got a 1430 01:07:00,720 --> 01:07:01,970 graph like this, basically. 1431 01:07:01,970 --> 01:07:05,380 1432 01:07:05,380 --> 01:07:07,600 Kind of something like this. 1433 01:07:07,600 --> 01:07:11,820 And the thing is, if I had never tried the 63, I would 1434 01:07:11,820 --> 01:07:13,780 have basically thought my performance was like 1435 01:07:13,780 --> 01:07:16,310 this, so far worse. 1436 01:07:16,310 --> 01:07:20,210 Never realized because it was all having conflicts in memory 1437 01:07:20,210 --> 01:07:21,220 because I'm [UNINTELLIGIBLE] 1438 01:07:21,220 --> 01:07:23,120 2 to the power numbers. 1439 01:07:23,120 --> 01:07:25,300 So this is what happens if you just look at 2 to the power 1440 01:07:25,300 --> 01:07:27,280 numbers, you'll think, yeah, this is the performance you're 1441 01:07:27,280 --> 01:07:29,160 getting, without realizing that, in fact, you 1442 01:07:29,160 --> 01:07:31,290 can get this far. 1443 01:07:31,290 --> 01:07:32,140 So this is good. 1444 01:07:32,140 --> 01:07:34,200 So I ran it on the new machine. 1445 01:07:34,200 --> 01:07:36,650 The problem with our new machine is there's no way to 1446 01:07:36,650 --> 01:07:39,510 turn off prefetching. 1447 01:07:39,510 --> 01:07:41,520 OK, so this is what you're getting. 1448 01:07:41,520 --> 01:07:44,320 Everything, because I am accessing this nice linear, 1449 01:07:44,320 --> 01:07:45,640 looks same. 1450 01:07:45,640 --> 01:07:47,550 This is exactly what you want to happen. 1451 01:07:47,550 --> 01:07:50,760 What it's doing is, minute you go to access, it realizes you 1452 01:07:50,760 --> 01:07:51,800 accessing linearly. 1453 01:07:51,800 --> 01:07:54,190 It start fetching, fetching, fetching data in front you, 1454 01:07:54,190 --> 01:07:55,310 it's going to catch up. 1455 01:07:55,310 --> 01:07:57,840 It's not feeding you data, so you're not going to have any 1456 01:07:57,840 --> 01:08:00,760 cache missed because you're going through this nice, 1457 01:08:00,760 --> 01:08:04,060 linear pattern, and you just basic end up with this 1458 01:08:04,060 --> 01:08:08,250 beautiful graph, which gives the feeling your working set 1459 01:08:08,250 --> 01:08:09,870 could be, now, gigabytes long. 1460 01:08:09,870 --> 01:08:14,130 1461 01:08:14,130 --> 01:08:19,620 At this point, I am fitting into the entire L3 cache, and 1462 01:08:19,620 --> 01:08:22,620 still look like this works beautiful because this is what 1463 01:08:22,620 --> 01:08:24,210 you want for it to figure out. 1464 01:08:24,210 --> 01:08:25,740 So now I'm like, geeze, this is great. 1465 01:08:25,740 --> 01:08:28,569 This means my cache works, it's very nice, but I want to 1466 01:08:28,569 --> 01:08:30,850 show you guys what's going on, so that doesn't help. 1467 01:08:30,850 --> 01:08:33,479 So I came up with this program. 1468 01:08:33,479 --> 01:08:37,240 So this program, what I did was I created a working set, 1469 01:08:37,240 --> 01:08:39,689 but I'm accessing randomly within that working set. 1470 01:08:39,689 --> 01:08:40,880 I calculated this address. 1471 01:08:40,880 --> 01:08:44,700 This is my poor man's random number generator, so basically 1472 01:08:44,700 --> 01:08:55,630 I'm adding number in here and multiplying this, and then the 1473 01:08:55,630 --> 01:08:59,290 masking these to get the right [UNINTELLIGIBLE] 1474 01:08:59,290 --> 01:09:01,850 so I get within that set in here. 1475 01:09:01,850 --> 01:09:04,040 And then I just access data, so I'm getting 1476 01:09:04,040 --> 01:09:04,490 [UNINTELLIGIBLE]. 1477 01:09:04,490 --> 01:09:06,770 So I get a number like this, and this jump up here, and 1478 01:09:06,770 --> 01:09:09,620 then I had this kind of behavior. 1479 01:09:09,620 --> 01:09:10,750 This is a lot more complicated. 1480 01:09:10,750 --> 01:09:12,779 I need to figure out what the hell is going on in here, so I 1481 01:09:12,779 --> 01:09:15,729 said, OK, let's now look at performance counters. 1482 01:09:15,729 --> 01:09:17,340 I look at L1 cache miss. 1483 01:09:17,340 --> 01:09:20,620 OK, up to 32K, I have no cache misses. 1484 01:09:20,620 --> 01:09:22,694 Boom, cache misses jump up. 1485 01:09:22,694 --> 01:09:23,189 Perfect. 1486 01:09:23,189 --> 01:09:26,380 And I have this little bit of a kink in here, this beautiful 1487 01:09:26,380 --> 01:09:30,670 describes because I have 32K L1 cache, and until then, I 1488 01:09:30,670 --> 01:09:32,990 have no things, and then when the cache start missing, I'm 1489 01:09:32,990 --> 01:09:34,149 jumping up. 1490 01:09:34,149 --> 01:09:34,750 Perfect. 1491 01:09:34,750 --> 01:09:36,550 I got that description here. 1492 01:09:36,550 --> 01:09:38,779 Then I look at my L2 cache misses. 1493 01:09:38,779 --> 01:09:41,915 This is, I think, the L2 cache in this machine might have a 1494 01:09:41,915 --> 01:09:43,340 little bit more complex [UNINTELLIGIBLE] 1495 01:09:43,340 --> 01:09:45,630 telling us because this is not going like 1496 01:09:45,630 --> 01:09:47,450 here had jumping up. 1497 01:09:47,450 --> 01:09:49,760 That's a 256K [UNINTELLIGIBLE] 1498 01:09:49,760 --> 01:09:50,240 get. 1499 01:09:50,240 --> 01:09:50,880 Is this 256? 1500 01:09:50,880 --> 01:09:51,720 Yeah. 1501 01:09:51,720 --> 01:09:52,580 [INAUDIBLE] 1502 01:09:52,580 --> 01:09:53,760 250 is [UNINTELLIGIBLE] 1503 01:09:53,760 --> 01:09:54,130 these things. 1504 01:09:54,130 --> 01:09:58,920 OK, so we have 32 256 2L megabytes in here. 1505 01:09:58,920 --> 01:10:03,034 Even if you have 256 in here-- 1506 01:10:03,034 --> 01:10:04,284 AUDIENCE: [INAUDIBLE] 1507 01:10:04,284 --> 01:10:06,506 1508 01:10:06,506 --> 01:10:07,002 PROFESSOR: Hm? 1509 01:10:07,002 --> 01:10:09,386 AUDIENCE: Should it be 256 or 32-- 1510 01:10:09,386 --> 01:10:10,670 PROFESSOR: [UNINTELLIGIBLE] 1511 01:10:10,670 --> 01:10:12,380 because everything's backed up, it's hierarchy 1512 01:10:12,380 --> 01:10:13,940 [UNINTELLIGIBLE], so it's not added to, it's not next to 1513 01:10:13,940 --> 01:10:16,820 each other, because at this point, L1 is now useless. 1514 01:10:16,820 --> 01:10:19,780 It'll run everything is just going to L2, so L2 is the one 1515 01:10:19,780 --> 01:10:21,205 who was serving it. 1516 01:10:21,205 --> 01:10:24,800 1517 01:10:24,800 --> 01:10:26,050 Come on. 1518 01:10:26,050 --> 01:10:29,920 1519 01:10:29,920 --> 01:10:32,640 So what happens in here, this normally should start jumping 1520 01:10:32,640 --> 01:10:36,390 here, 256, but it start jumping earlier, so there 1521 01:10:36,390 --> 01:10:37,570 might be some other interesting 1522 01:10:37,570 --> 01:10:40,090 thing going on here. 1523 01:10:40,090 --> 01:10:42,130 It might be interesting to investigate. 1524 01:10:42,130 --> 01:10:44,290 So this is OK, so still I am pretty close. 1525 01:10:44,290 --> 01:10:48,740 I realized this jump is basically because of L2 1526 01:10:48,740 --> 01:10:49,860 happen, this jump in here. 1527 01:10:49,860 --> 01:10:51,174 L2 start missing. 1528 01:10:51,174 --> 01:10:53,890 Then I look at L3. 1529 01:10:53,890 --> 01:10:56,540 It's here. 1530 01:10:56,540 --> 01:10:58,940 What's happening here? 1531 01:10:58,940 --> 01:11:03,520 So this was a big mystery the two of us were trying to debug 1532 01:11:03,520 --> 01:11:06,130 what one hour ago. 1533 01:11:06,130 --> 01:11:07,440 Not one hour. 1534 01:11:07,440 --> 01:11:08,030 Class is going. 1535 01:11:08,030 --> 01:11:08,710 Like two hours ago. 1536 01:11:08,710 --> 01:11:10,140 We was trying to figure out, OK, wait a minute, this 1537 01:11:10,140 --> 01:11:10,750 doesn't make sense. 1538 01:11:10,750 --> 01:11:12,000 My performance is here. 1539 01:11:12,000 --> 01:11:14,950 L3 should be just basically a nice flattening out here, and 1540 01:11:14,950 --> 01:11:16,020 just jumping. 1541 01:11:16,020 --> 01:11:18,480 It's not doing that, and we are like what's going on here. 1542 01:11:18,480 --> 01:11:21,940 So then we were scratching our heads, and we start looking. 1543 01:11:21,940 --> 01:11:23,630 I say, ha, there were other things going on. 1544 01:11:23,630 --> 01:11:26,670 We looked at this thing called TLB. 1545 01:11:26,670 --> 01:11:28,110 Let me tell you what TLB is. 1546 01:11:28,110 --> 01:11:29,990 What we realize is at this point, we are [? seeing ?] 1547 01:11:29,990 --> 01:11:32,200 TLB misses. 1548 01:11:32,200 --> 01:11:35,900 How many people know what a TLB is? 1549 01:11:35,900 --> 01:11:39,280 OK, good, so let me spend the last couple of minutes 1550 01:11:39,280 --> 01:11:41,360 explaining what a TLB this, and that kind of explain 1551 01:11:41,360 --> 01:11:43,390 what's going on in this graph. 1552 01:11:43,390 --> 01:11:48,080 So TLB means-- 1553 01:11:48,080 --> 01:11:54,910 Normally, data is stored in memory using real addresses. 1554 01:11:54,910 --> 01:11:56,970 That means there's a memory size in here, there's an 1555 01:11:56,970 --> 01:11:59,140 address format, there's certain [UNINTELLIGIBLE]. 1556 01:11:59,140 --> 01:12:02,400 And then the program for it use virtual memory. 1557 01:12:02,400 --> 01:12:05,210 That means programs want to feel like they have a lot more 1558 01:12:05,210 --> 01:12:07,540 memory than actual, physical memory available. 1559 01:12:07,540 --> 01:12:10,330 So what happens is you have virtual memory address, but 1560 01:12:10,330 --> 01:12:13,260 somebody has to map virtual memory to physical memory. 1561 01:12:13,260 --> 01:12:16,260 So the way it work is there's a thing called translation 1562 01:12:16,260 --> 01:12:20,690 lookaside buffer, TLB, that says if you give this virtual 1563 01:12:20,690 --> 01:12:22,600 address, here's the right physical address. 1564 01:12:22,600 --> 01:12:25,590 So of course you can't do it for each address by address. 1565 01:12:25,590 --> 01:12:30,880 What you have is you have 4K pages, so the address space is 1566 01:12:30,880 --> 01:12:32,625 broken down into 4K pages. 1567 01:12:32,625 --> 01:12:35,410 1568 01:12:35,410 --> 01:12:42,140 Each page will have an entry in this TLB, saying if you get 1569 01:12:42,140 --> 01:12:46,270 this virtual address, here's the right physical address. 1570 01:12:46,270 --> 01:12:48,530 So normally what happens is when the operating system they 1571 01:12:48,530 --> 01:12:53,030 created this large table, for every place in memory, you use 1572 01:12:53,030 --> 01:12:56,040 this table, and it loads into memory. 1573 01:12:56,040 --> 01:12:59,220 Then the hardware will fetch and cache some of those 1574 01:12:59,220 --> 01:13:03,820 entries in this TLB cache in there. 1575 01:13:03,820 --> 01:13:09,680 So normally what happens is in here, it caches 52L entries. 1576 01:13:09,680 --> 01:13:12,620 It can keep 52L entries to 52L pages, 1577 01:13:12,620 --> 01:13:14,560 saying here's the mapping. 1578 01:13:14,560 --> 01:13:20,240 The problem with 52L pages is in a 4K byte page size, 52L 1579 01:13:20,240 --> 01:13:24,730 pages can only hold two megabytes of data. 1580 01:13:24,730 --> 01:13:27,640 If you're accessing more than two megabytes of data, you 1581 01:13:27,640 --> 01:13:30,100 have to go more than 52L pages, and at that point, that 1582 01:13:30,100 --> 01:13:34,010 mapping is not in the TLB, and that means that you have to go 1583 01:13:34,010 --> 01:13:35,640 do a mapping translation. 1584 01:13:35,640 --> 01:13:38,760 The problem for this is these 4K pages is a 1585 01:13:38,760 --> 01:13:42,440 very ancient artifact. 1586 01:13:42,440 --> 01:13:45,480 There's the new machines, you can I ask for what we call 1587 01:13:45,480 --> 01:13:48,195 large-pages or super, but you can ask for 2-megabyte or 1588 01:13:48,195 --> 01:13:49,620 4-megabyte pages. 1589 01:13:49,620 --> 01:13:53,220 If you ask for super-pages, then of course, 52L entry's 1590 01:13:53,220 --> 01:13:54,710 good enough. 1591 01:13:54,710 --> 01:13:58,550 It can hold enough things not to have TLB miss, at least 1592 01:13:58,550 --> 01:13:59,480 while you're in the cache. 1593 01:13:59,480 --> 01:14:02,520 But since Linux, without doing anything, only asks for 4K 1594 01:14:02,520 --> 01:14:06,230 pages, before you run out of your L3 cache, you run out of 1595 01:14:06,230 --> 01:14:08,140 the TLB table. 1596 01:14:08,140 --> 01:14:11,650 And it has to start fetching and [UNINTELLIGIBLE] 1597 01:14:11,650 --> 01:14:14,330 data table, that means you had to actually fetch the table 1598 01:14:14,330 --> 01:14:16,410 entries, and so you have to do more fetchings. 1599 01:14:16,410 --> 01:14:18,880 So that is why this is even running higher, because it's 1600 01:14:18,880 --> 01:14:22,370 trying to fetch this TLB entry and stuff like that. 1601 01:14:22,370 --> 01:14:22,830 Question? 1602 01:14:22,830 --> 01:14:25,508 AUDIENCE: Is there a way to then actually fiddle with the 1603 01:14:25,508 --> 01:14:29,170 TLB so that you have to fit your pages rather than-- 1604 01:14:29,170 --> 01:14:32,450 PROFESSOR: So in allocation, you can ask for super-pages or 1605 01:14:32,450 --> 01:14:34,400 large-pages. 1606 01:14:34,400 --> 01:14:38,520 But in Linux, normally just ask for 4K pages, so that the 1607 01:14:38,520 --> 01:14:40,390 operating systems haven't kind of caught up 1608 01:14:40,390 --> 01:14:42,050 to the current hardware. 1609 01:14:42,050 --> 01:14:43,560 And this is was built [UNINTELLIGIBLE] 1610 01:14:43,560 --> 01:14:45,860 doesn't make sense TLB to run out before 1611 01:14:45,860 --> 01:14:47,110 you run out the cache. 1612 01:14:47,110 --> 01:14:49,590 I mean, that's kind of mismatch in here. 1613 01:14:49,590 --> 01:14:52,480 But it's built assuming you'll get large-pages, but if you 1614 01:14:52,480 --> 01:14:54,780 don't, you're going to run out of the TLB in there. 1615 01:14:54,780 --> 01:14:58,440 So this is why sometimes performance engineering is 1616 01:14:58,440 --> 01:15:01,470 interesting, exciting, and frustrating, because you 1617 01:15:01,470 --> 01:15:03,210 assume you know where the hell this is going. 1618 01:15:03,210 --> 01:15:05,460 You assume this is going to go here and that, and suddenly, 1619 01:15:05,460 --> 01:15:08,240 this miss shows up, and so we have to scratch our heads. 1620 01:15:08,240 --> 01:15:11,820 And luckily, we found what's going on, but we could have 1621 01:15:11,820 --> 01:15:14,560 spent hours or days trying to figure out what's going on 1622 01:15:14,560 --> 01:15:18,480 here because this is one [UNINTELLIGIBLE] 1623 01:15:18,480 --> 01:15:19,900 say OK, let's try TLB, and it worked. 1624 01:15:19,900 --> 01:15:23,870 That was really good, but I have, many cases, spent days 1625 01:15:23,870 --> 01:15:27,450 trying to figure out why something is going on because 1626 01:15:27,450 --> 01:15:29,630 performance is kind of a [UNINTELLIGIBLE] 1627 01:15:29,630 --> 01:15:30,210 of everything. 1628 01:15:30,210 --> 01:15:32,300 There's no nice abstractions and stuff like that. 1629 01:15:32,300 --> 01:15:35,590 If you think abstraction wise, somewhere outside the 1630 01:15:35,590 --> 01:15:37,580 abstraction, that's going to kill you. 1631 01:15:37,580 --> 01:15:39,750 So that is why it can be interesting, because you have 1632 01:15:39,750 --> 01:15:41,870 to know everything. 1633 01:15:41,870 --> 01:15:45,030 And the fun thing is, once you understand performance, you 1634 01:15:45,030 --> 01:15:47,730 basically understand end-to-end, soup to nuts, 1635 01:15:47,730 --> 01:15:49,160 what's going on. 1636 01:15:49,160 --> 01:15:52,520 OK, with that encouraging thought, let's see how you 1637 01:15:52,520 --> 01:15:55,335 guys do in the next part of project two. 1638 01:15:55,335 --> 01:16:03,691